tooluniverse-multi-omics-integration
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMulti-Omics Integration
多组学整合
Coordinate and integrate multiple omics datasets for comprehensive systems biology analysis. This skill orchestrates specialized ToolUniverse skills to perform cross-omics correlation, multi-omics clustering, pathway-level integration, and unified interpretation across molecular layers.
协调并整合多组学数据集,以开展全面的系统生物学分析。该技能统筹ToolUniverse的专项技能,执行跨组学关联分析、多组学聚类、通路层面整合以及跨分子层面的统一解读。
When to Use This Skill
何时使用该技能
Triggers:
- User has multiple omics datasets (RNA-seq + proteomics, methylation + expression, etc.)
- Requests for integrative multi-omics analysis
- Cross-omics correlation queries (e.g., "How does methylation affect expression?")
- Multi-omics biomarker discovery
- Systems biology questions requiring multiple molecular layers
- Precision medicine applications with multi-omics patient data
- Questions about molecular mechanisms across omics types
Example Questions This Skill Solves:
- "Integrate RNA-seq and proteomics data to find genes with concordant changes"
- "How does promoter methylation correlate with gene expression?"
- "Perform multi-omics clustering to identify patient subtypes"
- "Which pathways are dysregulated across transcriptome, proteome, and metabolome?"
- "Find multi-omics biomarkers for disease classification"
- "Correlate CNV with gene expression to identify dosage effects"
- "Integrate GWAS variants, eQTLs, and expression data"
- "Perform MOFA+ analysis on multi-omics cancer data"
触发场景:
- 用户拥有多组学数据集(RNA-seq + 蛋白质组学、甲基化 + 表达数据等)
- 请求进行整合式多组学分析
- 跨组学关联查询(例如:"甲基化如何影响基因表达?")
- 多组学生物标志物发现
- 需要涉及多个分子层面的系统生物学问题
- 包含多组学患者数据的精准医疗应用
- 跨组学类型的分子机制相关问题
该技能可解决的示例问题:
- "整合RNA-seq和蛋白质组学数据,找出表达变化一致的基因"
- "启动子甲基化与基因表达有何关联?"
- "执行多组学聚类以识别患者亚型"
- "哪些通路在转录组、蛋白质组和代谢组中均出现失调?"
- "寻找用于疾病分类的多组学生物标志物"
- "关联CNV与基因表达以识别剂量效应"
- "整合GWAS变异、eQTL和表达数据"
- "对多组学癌症数据执行MOFA+分析"
Core Capabilities
核心能力
| Capability | Description |
|---|---|
| Data Integration | Match samples across omics, handle missing data, normalize scales |
| Cross-Omics Correlation | Correlate features across molecular layers (gene expression vs protein, methylation vs expression) |
| Multi-Omics Clustering | MOFA+, NMF, joint clustering to identify omics-driven subtypes |
| Pathway Integration | Combine omics evidence at pathway level for unified biological interpretation |
| Biomarker Discovery | Identify multi-omics signatures with improved predictive power |
| Skill Coordination | Orchestrate RNA-seq, epigenomics, variant-analysis, protein-interactions, gene-enrichment skills |
| Visualization | Circos plots, integrated heatmaps, network visualizations |
| Reporting | Unified multi-omics reports with cross-layer insights |
| 能力 | 描述 |
|---|---|
| 数据整合 | 匹配跨组学样本,处理缺失数据,统一数据尺度 |
| 跨组学关联分析 | 关联不同分子层面的特征(基因表达 vs 蛋白质、甲基化 vs 表达) |
| 多组学聚类 | 使用MOFA+、NMF、联合聚类识别由组学驱动的亚型 |
| 通路整合 | 在通路层面整合多组学证据,进行统一的生物学解读 |
| 生物标志物发现 | 识别具有更强预测能力的多组学特征 |
| 技能协调 | 统筹RNA-seq、表观基因组学、变异分析、蛋白质互作、基因富集等技能 |
| 可视化 | Circos图、整合热图、网络可视化 |
| 报告生成 | 包含跨层面洞察的统一多组学报告 |
Workflow Overview
工作流程概览
Input: Multiple Omics Datasets
|
v
Phase 1: Data Loading & QC
|-- Load RNA-seq (expression matrix)
|-- Load proteomics (protein abundance)
|-- Load methylation (beta values or M-values)
|-- Load variants (CNV, SNV from VCF)
|-- Load metabolomics (metabolite abundance)
|-- Quality control per omics type
|
v
Phase 2: Sample Matching
|-- Match samples across omics by ID
|-- Identify common samples
|-- Handle batch effects
|-- Normalize sample identifiers
|
v
Phase 3: Feature Mapping
|-- Map features to common identifier space (genes, proteins, metabolites)
|-- Link CpG sites to genes (promoter, gene body)
|-- Map variants to genes
|-- Create unified feature matrix
|
v
Phase 4: Cross-Omics Correlation
|-- Gene expression vs protein abundance (translation efficiency)
|-- Promoter methylation vs expression (epigenetic regulation)
|-- CNV vs expression (dosage effect)
|-- eQTL variants vs expression (genetic regulation)
|-- Metabolite vs enzyme expression (metabolic flux)
|
v
Phase 5: Multi-Omics Clustering
|-- MOFA+ (Multi-Omics Factor Analysis) for latent factors
|-- NMF (Non-negative Matrix Factorization) for patient subtypes
|-- Joint clustering across omics
|-- Identify omics-specific vs shared variation
|
v
Phase 6: Pathway-Level Integration
|-- Aggregate omics to pathway level
|-- Score pathway dysregulation (combined evidence)
|-- Use ToolUniverse enrichment tools (Reactome, KEGG, GO)
|-- Identify driver pathways across omics
|
v
Phase 7: Biomarker Discovery
|-- Feature selection across omics
|-- Multi-omics signatures for classification
|-- Cross-validation and performance
|-- Interpretation and biological validation
|
v
Phase 8: Generate Integrated Report
|-- Summary statistics per omics
|-- Cross-omics correlation results
|-- Multi-omics clusters and subtypes
|-- Top dysregulated pathways
|-- Multi-omics biomarkers
|-- Biological interpretationInput: Multiple Omics Datasets
|
v
Phase 1: Data Loading & QC
|-- Load RNA-seq (expression matrix)
|-- Load proteomics (protein abundance)
|-- Load methylation (beta values or M-values)
|-- Load variants (CNV, SNV from VCF)
|-- Load metabolomics (metabolite abundance)
|-- Quality control per omics type
|
v
Phase 2: Sample Matching
|-- Match samples across omics by ID
|-- Identify common samples
|-- Handle batch effects
|-- Normalize sample identifiers
|
v
Phase 3: Feature Mapping
|-- Map features to common identifier space (genes, proteins, metabolites)
|-- Link CpG sites to genes (promoter, gene body)
|-- Map variants to genes
|-- Create unified feature matrix
|
v
Phase 4: Cross-Omics Correlation
|-- Gene expression vs protein abundance (translation efficiency)
|-- Promoter methylation vs expression (epigenetic regulation)
|-- CNV vs expression (dosage effect)
|-- eQTL variants vs expression (genetic regulation)
|-- Metabolite vs enzyme expression (metabolic flux)
|
v
Phase 5: Multi-Omics Clustering
|-- MOFA+ (Multi-Omics Factor Analysis) for latent factors
|-- NMF (Non-negative Matrix Factorization) for patient subtypes
|-- Joint clustering across omics
|-- Identify omics-specific vs shared variation
|
v
Phase 6: Pathway-Level Integration
|-- Aggregate omics to pathway level
|-- Score pathway dysregulation (combined evidence)
|-- Use ToolUniverse enrichment tools (Reactome, KEGG, GO)
|-- Identify driver pathways across omics
|
v
Phase 7: Biomarker Discovery
|-- Feature selection across omics
|-- Multi-omics signatures for classification
|-- Cross-validation and performance
|-- Interpretation and biological validation
|
v
Phase 8: Generate Integrated Report
|-- Summary statistics per omics
|-- Cross-omics correlation results
|-- Multi-omics clusters and subtypes
|-- Top dysregulated pathways
|-- Multi-omics biomarkers
|-- Biological interpretationPhase Details
各阶段详情
Phase 1: Data Loading & Quality Control
阶段1:数据加载与质量控制
Objective: Load multiple omics datasets and perform quality control.
Supported omics types:
- Transcriptomics: RNA-seq count matrices, microarray
- Proteomics: Protein abundance (MS-based)
- Epigenomics: Methylation (450K, EPIC arrays, WGBS), ChIP-seq peaks
- Genomics: CNV, SNV, structural variants
- Metabolomics: Metabolite abundance (targeted, untargeted)
Data formats:
- Expression: CSV/TSV matrices, HDF5, AnnData (.h5ad)
- Proteomics: MaxQuant output, Spectronaut, DIA-NN
- Methylation: IDAT files, beta value matrices
- Variants: VCF, SEG files (CNV)
- Metabolomics: Peak tables, identified metabolites
Quality control per omics:
python
undefined目标:加载多组学数据集并执行质量控制。
支持的组学类型:
- 转录组学:RNA-seq计数矩阵、微阵列
- 蛋白质组学:蛋白质丰度(基于质谱)
- 表观基因组学:甲基化(450K、EPIC芯片、WGBS)、ChIP-seq峰
- 基因组学:CNV、SNV、结构变异
- 代谢组学:代谢物丰度(靶向、非靶向)
数据格式:
- 表达数据:CSV/TSV矩阵、HDF5、AnnData(.h5ad)
- 蛋白质组学:MaxQuant输出、Spectronaut、DIA-NN
- 甲基化:IDAT文件、beta值矩阵
- 变异数据:VCF、SEG文件(CNV)
- 代谢组学:峰表、已鉴定代谢物
各组学的质量控制:
python
undefinedRNA-seq QC
RNA-seq QC
- Filter low-count genes (mean counts < threshold)
- Normalize (TPM, FPKM, or DESeq2)
- Log-transform for correlation
- Filter low-count genes (mean counts < threshold)
- Normalize (TPM, FPKM, or DESeq2)
- Log-transform for correlation
Proteomics QC
Proteomics QC
- Filter proteins with high missing values
- Impute missing values (minimum, KNN)
- Normalize (median, quantile)
- Filter proteins with high missing values
- Impute missing values (minimum, KNN)
- Normalize (median, quantile)
Methylation QC
Methylation QC
- Remove failed probes
- Correct for batch effects (ComBat)
- Filter cross-reactive probes
- Remove failed probes
- Correct for batch effects (ComBat)
- Filter cross-reactive probes
Variants QC
Variants QC
- Use variant-analysis skill for VCF QC
- CNV segmentation validation
undefined- Use variant-analysis skill for VCF QC
- CNV segmentation validation
undefinedPhase 2: Sample Matching
阶段2:样本匹配
Objective: Identify common samples across omics datasets.
Sample ID harmonization:
python
def match_samples_across_omics(omics_data_dict):
"""
Match samples across multiple omics datasets.
Parameters:
omics_data_dict: {
'rnaseq': DataFrame (genes x samples),
'proteomics': DataFrame (proteins x samples),
'methylation': DataFrame (CpGs x samples),
'cnv': DataFrame (genes x samples)
}
Returns:
- common_samples: List of sample IDs present in all omics
- matched_data: Dict of DataFrames with common samples only
"""
# Extract sample IDs from each omics
sample_ids = {
omics_type: set(df.columns)
for omics_type, df in omics_data_dict.items()
}
# Find common samples (intersection)
common_samples = set.intersection(*sample_ids.values())
# Subset each omics to common samples
matched_data = {
omics_type: df[sorted(common_samples)]
for omics_type, df in omics_data_dict.items()
}
return sorted(common_samples), matched_dataHandling missing omics:
- Pairwise integration if not all samples have all omics
- Document sample availability matrix
目标:识别跨多组学数据集的共同样本。
样本ID统一:
python
def match_samples_across_omics(omics_data_dict):
"""
Match samples across multiple omics datasets.
Parameters:
omics_data_dict: {
'rnaseq': DataFrame (genes x samples),
'proteomics': DataFrame (proteins x samples),
'methylation': DataFrame (CpGs x samples),
'cnv': DataFrame (genes x samples)
}
Returns:
- common_samples: List of sample IDs present in all omics
- matched_data: Dict of DataFrames with common samples only
"""
# Extract sample IDs from each omics
sample_ids = {
omics_type: set(df.columns)
for omics_type, df in omics_data_dict.items()
}
# Find common samples (intersection)
common_samples = set.intersection(*sample_ids.values())
# Subset each omics to common samples
matched_data = {
omics_type: df[sorted(common_samples)]
for omics_type, df in omics_data_dict.items()
}
return sorted(common_samples), matched_data处理缺失组学数据:
- 若并非所有样本都拥有全部组学数据,可执行两两整合
- 记录样本可用情况矩阵
Phase 3: Feature Mapping
阶段3:特征映射
Objective: Map features from different omics to common gene-level identifiers.
Gene-centric integration:
python
undefined目标:将不同组学的特征映射到通用的基因层面标识符。
以基因为中心的整合:
python
undefinedMap all features to genes
Map all features to genes
feature_mapping = {
'rnaseq': 'gene_symbol', # Already gene-level
'proteomics': 'gene_symbol', # Map protein to gene
'methylation': 'gene_symbol', # Map CpG to gene (promoter)
'cnv': 'gene_symbol', # CNV regions to overlapping genes
'metabolomics': 'enzyme_gene' # Metabolite to enzyme gene
}
**CpG to gene mapping**:
- **Promoter methylation**: CpGs within TSS ± 2kb
- **Gene body methylation**: CpGs within gene boundaries
- Average methylation per gene (weighted by probe coverage)
**CNV to gene mapping**:
- Use variant-analysis skill to identify genes in CNV regions
- Calculate copy number per gene (log2 ratio)feature_mapping = {
'rnaseq': 'gene_symbol', # Already gene-level
'proteomics': 'gene_symbol', # Map protein to gene
'methylation': 'gene_symbol', # Map CpG to gene (promoter)
'cnv': 'gene_symbol', # CNV regions to overlapping genes
'metabolomics': 'enzyme_gene' # Metabolite to enzyme gene
}
**CpG位点到基因的映射**:
- **启动子甲基化**:TSS ± 2kb范围内的CpG位点
- **基因体甲基化**:基因边界内的CpG位点
- 按探针覆盖度加权计算每个基因的平均甲基化水平
**CNV到基因的映射**:
- 使用变异分析技能识别CNV区域内的基因
- 计算每个基因的拷贝数(log2比值)Phase 4: Cross-Omics Correlation
阶段4:跨组学关联分析
Objective: Correlate features across molecular layers to understand regulation.
Example analyses:
目标:关联不同分子层面的特征,以理解调控机制。
示例分析:
4.1: Expression vs Protein (Translation Efficiency)
4.1:表达与蛋白质(翻译效率)
python
def correlate_rna_protein(rnaseq_data, proteomics_data):
"""
Correlate mRNA and protein levels for each gene.
Expected: Positive correlation (r ~ 0.4-0.6 typical)
Discordance indicates post-transcriptional regulation
"""
# Find common genes
common_genes = set(rnaseq_data.index) & set(proteomics_data.index)
correlations = {}
for gene in common_genes:
rna = rnaseq_data.loc[gene]
protein = proteomics_data.loc[gene]
# Spearman correlation (robust to outliers)
r, p = spearmanr(rna, protein)
correlations[gene] = {'r': r, 'p': p}
# Identify discordant genes (low RNA-protein correlation)
discordant = {g: v for g, v in correlations.items() if abs(v['r']) < 0.2}
return correlations, discordantpython
def correlate_rna_protein(rnaseq_data, proteomics_data):
"""
Correlate mRNA and protein levels for each gene.
Expected: Positive correlation (r ~ 0.4-0.6 typical)
Discordance indicates post-transcriptional regulation
"""
# Find common genes
common_genes = set(rnaseq_data.index) & set(proteomics_data.index)
correlations = {}
for gene in common_genes:
rna = rnaseq_data.loc[gene]
protein = proteomics_data.loc[gene]
# Spearman correlation (robust to outliers)
r, p = spearmanr(rna, protein)
correlations[gene] = {'r': r, 'p': p}
# Identify discordant genes (low RNA-protein correlation)
discordant = {g: v for g, v in correlations.items() if abs(v['r']) < 0.2}
return correlations, discordant4.2: Methylation vs Expression (Epigenetic Regulation)
4.2:甲基化与表达(表观遗传调控)
python
def correlate_methylation_expression(methylation_data, rnaseq_data):
"""
Correlate promoter methylation with gene expression.
Expected: Negative correlation (increased methylation → decreased expression)
"""
# For each gene with promoter methylation
results = {}
for gene in methylation_data.index:
if gene in rnaseq_data.index:
meth = methylation_data.loc[gene] # Average promoter beta
expr = rnaseq_data.loc[gene]
r, p = spearmanr(meth, expr)
results[gene] = {'r': r, 'p': p, 'direction': 'repressive' if r < 0 else 'activating'}
# Identify genes with strong methylation-expression anticorrelation
regulated = {g: v for g, v in results.items() if v['r'] < -0.5 and v['p'] < 0.01}
return results, regulatedpython
def correlate_methylation_expression(methylation_data, rnaseq_data):
"""
Correlate promoter methylation with gene expression.
Expected: Negative correlation (increased methylation → decreased expression)
"""
# For each gene with promoter methylation
results = {}
for gene in methylation_data.index:
if gene in rnaseq_data.index:
meth = methylation_data.loc[gene] # Average promoter beta
expr = rnaseq_data.loc[gene]
r, p = spearmanr(meth, expr)
results[gene] = {'r': r, 'p': p, 'direction': 'repressive' if r < 0 else 'activating'}
# Identify genes with strong methylation-expression anticorrelation
regulated = {g: v for g, v in results.items() if v['r'] < -0.5 and v['p'] < 0.01}
return results, regulated4.3: CNV vs Expression (Dosage Effect)
4.3:CNV与表达(剂量效应)
python
def correlate_cnv_expression(cnv_data, rnaseq_data):
"""
Correlate copy number with gene expression.
Expected: Positive correlation (gene dosage effect)
"""
results = {}
for gene in cnv_data.index:
if gene in rnaseq_data.index:
cnv = cnv_data.loc[gene] # log2 ratio
expr = rnaseq_data.loc[gene]
r, p = pearsonr(cnv, expr)
results[gene] = {'r': r, 'p': p}
# Genes with dosage effect (CNV drives expression)
dosage_genes = {g: v for g, v in results.items() if v['r'] > 0.5 and v['p'] < 0.01}
return results, dosage_genespython
def correlate_cnv_expression(cnv_data, rnaseq_data):
"""
Correlate copy number with gene expression.
Expected: Positive correlation (gene dosage effect)
"""
results = {}
for gene in cnv_data.index:
if gene in rnaseq_data.index:
cnv = cnv_data.loc[gene] # log2 ratio
expr = rnaseq_data.loc[gene]
r, p = pearsonr(cnv, expr)
results[gene] = {'r': r, 'p': p}
# Genes with dosage effect (CNV drives expression)
dosage_genes = {g: v for g, v in results.items() if v['r'] > 0.5 and v['p'] < 0.01}
return results, dosage_genesPhase 5: Multi-Omics Clustering
阶段5:多组学聚类
Objective: Identify patient subtypes using integrated omics data.
Method 1: MOFA+ (Multi-Omics Factor Analysis)
MOFA+ identifies latent factors that explain variation across omics.
python
undefined目标:使用整合的组学数据识别患者亚型。
方法1:MOFA+(多组学因子分析)
MOFA+识别可解释跨组学变异的潜在因子。
python
undefinedConceptual workflow (uses R's MOFA2 package or Python implementation)
Conceptual workflow (uses R's MOFA2 package or Python implementation)
1. Prepare multi-omics data as list of matrices
1. Prepare multi-omics data as list of matrices
2. Run MOFA+ to identify factors
2. Run MOFA+ to identify factors
3. Inspect factor variance explained per omics
3. Inspect factor variance explained per omics
4. Cluster samples based on factor scores
4. Cluster samples based on factor scores
Example interpretation:
Example interpretation:
Factor 1: Explains 40% variance in RNA-seq, 30% in proteomics → Cell proliferation
Factor 1: Explains 40% variance in RNA-seq, 30% in proteomics → Cell proliferation
Factor 2: Explains 50% variance in methylation → Epigenetic subtype
Factor 2: Explains 50% variance in methylation → Epigenetic subtype
Factor 3: Explains 20% variance in CNV → Genomic instability
Factor 3: Explains 20% variance in CNV → Genomic instability
**Method 2: Joint NMF (Non-negative Matrix Factorization)**
Decompose multi-omics matrices into shared latent components.
```python
def joint_nmf_clustering(omics_data_dict, n_clusters=3):
"""
Perform joint NMF across omics for clustering.
Returns patient cluster assignments based on shared factors.
"""
# Concatenate omics matrices (after normalization)
combined_matrix = np.vstack([
omics_data_dict['rnaseq'].values,
omics_data_dict['proteomics'].values,
omics_data_dict['methylation'].values
])
# Run NMF
from sklearn.decomposition import NMF
model = NMF(n_components=n_clusters, init='nndsvd', random_state=42)
W = model.fit_transform(combined_matrix) # Feature loadings
H = model.components_ # Sample coefficients
# Cluster samples based on H (components)
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters=n_clusters).fit_predict(H.T)
return clusters, W, HMethod 3: Similarity Network Fusion (SNF)
Integrate omics through patient similarity networks.
**方法2:联合NMF(非负矩阵分解)**
分解多组学矩阵以得到共享的潜在成分。
```python
def joint_nmf_clustering(omics_data_dict, n_clusters=3):
"""
Perform joint NMF across omics for clustering.
Returns patient cluster assignments based on shared factors.
"""
# Concatenate omics matrices (after normalization)
combined_matrix = np.vstack([
omics_data_dict['rnaseq'].values,
omics_data_dict['proteomics'].values,
omics_data_dict['methylation'].values
])
# Run NMF
from sklearn.decomposition import NMF
model = NMF(n_components=n_clusters, init='nndsvd', random_state=42)
W = model.fit_transform(combined_matrix) # Feature loadings
H = model.components_ # Sample coefficients
# Cluster samples based on H (components)
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters=n_clusters).fit_predict(H.T)
return clusters, W, H方法3:相似性网络融合(SNF)
通过患者相似性网络整合多组学数据。
Phase 6: Pathway-Level Integration
阶段6:通路层面整合
Objective: Aggregate multi-omics evidence at the pathway level.
Approach: Score pathway dysregulation using combined evidence from multiple omics.
python
def integrate_pathway_evidence(omics_results, pathway_genes):
"""
Score pathway dysregulation across omics.
omics_results: {
'rnaseq': {'gene': fold_change},
'proteomics': {'gene': fold_change},
'methylation': {'gene': methylation_diff},
'cnv': {'gene': copy_number}
}
pathway_genes: List of genes in pathway
"""
# For each gene in pathway
pathway_scores = []
for gene in pathway_genes:
gene_score = 0
evidence_count = 0
# RNA-seq evidence
if gene in omics_results['rnaseq']:
gene_score += abs(omics_results['rnaseq'][gene])
evidence_count += 1
# Proteomics evidence
if gene in omics_results['proteomics']:
gene_score += abs(omics_results['proteomics'][gene])
evidence_count += 1
# Methylation evidence (negative correlation)
if gene in omics_results['methylation']:
gene_score += abs(omics_results['methylation'][gene])
evidence_count += 1
# CNV evidence
if gene in omics_results['cnv']:
gene_score += abs(omics_results['cnv'][gene])
evidence_count += 1
if evidence_count > 0:
pathway_scores.append(gene_score / evidence_count)
# Aggregate pathway score (mean of gene scores)
pathway_score = np.mean(pathway_scores) if pathway_scores else 0
return {
'pathway_score': pathway_score,
'n_genes_with_evidence': len(pathway_scores),
'n_omics_types': evidence_count
}Use ToolUniverse enrichment tools:
python
undefined目标:在通路层面聚合多组学证据。
方法:使用来自多个组学的综合证据对通路失调进行评分。
python
def integrate_pathway_evidence(omics_results, pathway_genes):
"""
Score pathway dysregulation across omics.
omics_results: {
'rnaseq': {'gene': fold_change},
'proteomics': {'gene': fold_change},
'methylation': {'gene': methylation_diff},
'cnv': {'gene': copy_number}
}
pathway_genes: List of genes in pathway
"""
# For each gene in pathway
pathway_scores = []
for gene in pathway_genes:
gene_score = 0
evidence_count = 0
# RNA-seq evidence
if gene in omics_results['rnaseq']:
gene_score += abs(omics_results['rnaseq'][gene])
evidence_count += 1
# Proteomics evidence
if gene in omics_results['proteomics']:
gene_score += abs(omics_results['proteomics'][gene])
evidence_count += 1
# Methylation evidence (negative correlation)
if gene in omics_results['methylation']:
gene_score += abs(omics_results['methylation'][gene])
evidence_count += 1
# CNV evidence
if gene in omics_results['cnv']:
gene_score += abs(omics_results['cnv'][gene])
evidence_count += 1
if evidence_count > 0:
pathway_scores.append(gene_score / evidence_count)
# Aggregate pathway score (mean of gene scores)
pathway_score = np.mean(pathway_scores) if pathway_scores else 0
return {
'pathway_score': pathway_score,
'n_genes_with_evidence': len(pathway_scores),
'n_omics_types': evidence_count
}使用ToolUniverse富集工具:
python
undefinedGet pathways for gene set
Get pathways for gene set
from tooluniverse import ToolUniverse
tu = ToolUniverse()
from tooluniverse import ToolUniverse
tu = ToolUniverse()
Enrichment for genes dysregulated in ANY omics
Enrichment for genes dysregulated in ANY omics
all_dysregulated_genes = set()
all_dysregulated_genes.update(rnaseq_degs)
all_dysregulated_genes.update(diff_proteins)
all_dysregulated_genes.update(methylation_dmgs)
all_dysregulated_genes = set()
all_dysregulated_genes.update(rnaseq_degs)
all_dysregulated_genes.update(diff_proteins)
all_dysregulated_genes.update(methylation_dmgs)
Run enrichment
Run enrichment
enrichment = tu.run_one_function({
"name": "enrichr_enrich",
"arguments": {
"gene_list": ",".join(all_dysregulated_genes),
"library": "KEGG_2021_Human"
}
})
enrichment = tu.run_one_function({
"name": "enrichr_enrich",
"arguments": {
"gene_list": ",".join(all_dysregulated_genes),
"library": "KEGG_2021_Human"
}
})
Score each pathway with multi-omics evidence
Score each pathway with multi-omics evidence
for pathway in enrichment['data']['results']:
pathway_genes = pathway['genes']
pathway['multi_omics_score'] = integrate_pathway_evidence(
omics_results, pathway_genes
)
undefinedfor pathway in enrichment['data']['results']:
pathway_genes = pathway['genes']
pathway['multi_omics_score'] = integrate_pathway_evidence(
omics_results, pathway_genes
)
undefinedPhase 7: Biomarker Discovery
阶段7:生物标志物发现
Objective: Identify multi-omics signatures for disease classification.
Feature selection across omics:
python
def select_multiomics_features(X_dict, y, n_features=50):
"""
Select top features across omics for classification.
X_dict: {
'rnaseq': DataFrame (samples x genes),
'proteomics': DataFrame (samples x proteins),
'methylation': DataFrame (samples x CpGs)
}
y: Target labels (disease vs control)
Returns: Selected features per omics
"""
from sklearn.feature_selection import SelectKBest, f_classif
selected_features = {}
for omics_type, X in X_dict.items():
selector = SelectKBest(f_classif, k=min(n_features, X.shape[1]))
selector.fit(X, y)
# Get selected feature names
selected_idx = selector.get_support()
selected_features[omics_type] = X.columns[selected_idx].tolist()
return selected_featuresMulti-omics classification:
python
def multiomics_classification(X_dict, y, selected_features):
"""
Train classifier using multi-omics features.
"""
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# Concatenate selected features from each omics
X_combined = []
for omics_type, features in selected_features.items():
X_combined.append(X_dict[omics_type][features])
X_combined = pd.concat(X_combined, axis=1)
# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(clf, X_combined, y, cv=5, scoring='roc_auc')
return {
'mean_auc': scores.mean(),
'std_auc': scores.std(),
'n_features': X_combined.shape[1],
'features_per_omics': {k: len(v) for k, v in selected_features.items()}
}目标:识别用于疾病分类的多组学特征。
跨组学特征选择:
python
def select_multiomics_features(X_dict, y, n_features=50):
"""
Select top features across omics for classification.
X_dict: {
'rnaseq': DataFrame (samples x genes),
'proteomics': DataFrame (samples x proteins),
'methylation': DataFrame (samples x CpGs)
}
y: Target labels (disease vs control)
Returns: Selected features per omics
"""
from sklearn.feature_selection import SelectKBest, f_classif
selected_features = {}
for omics_type, X in X_dict.items():
selector = SelectKBest(f_classif, k=min(n_features, X.shape[1]))
selector.fit(X, y)
# Get selected feature names
selected_idx = selector.get_support()
selected_features[omics_type] = X.columns[selected_idx].tolist()
return selected_features多组学分类:
python
def multiomics_classification(X_dict, y, selected_features):
"""
Train classifier using multi-omics features.
"""
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# Concatenate selected features from each omics
X_combined = []
for omics_type, features in selected_features.items():
X_combined.append(X_dict[omics_type][features])
X_combined = pd.concat(X_combined, axis=1)
# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(clf, X_combined, y, cv=5, scoring='roc_auc')
return {
'mean_auc': scores.mean(),
'std_auc': scores.std(),
'n_features': X_combined.shape[1],
'features_per_omics': {k: len(v) for k, v in selected_features.items()}
}Phase 8: Integrated Reporting
阶段8:整合报告生成
Generate comprehensive multi-omics report:
markdown
undefined生成全面的多组学报告:
markdown
undefinedMulti-Omics Integration Report
Multi-Omics Integration Report
Dataset Summary
Dataset Summary
- Omics Types: RNA-seq, Proteomics, Methylation, CNV
- Common Samples: 45 patients (30 disease, 15 control)
- Features: 15,000 genes, 5,000 proteins, 450K CpGs, 20K CNV regions
- Omics Types: RNA-seq, Proteomics, Methylation, CNV
- Common Samples: 45 patients (30 disease, 15 control)
- Features: 15,000 genes, 5,000 proteins, 450K CpGs, 20K CNV regions
Cross-Omics Correlation
Cross-Omics Correlation
RNA-Protein Correlation
RNA-Protein Correlation
- Overall correlation: r = 0.52 (expected: 0.4-0.6)
- Highly correlated: 3,245 genes (45%)
- Discordant genes: 890 genes (post-transcriptional regulation)
- Overall correlation: r = 0.52 (expected: 0.4-0.6)
- Highly correlated: 3,245 genes (45%)
- Discordant genes: 890 genes (post-transcriptional regulation)
Methylation-Expression
Methylation-Expression
- Promoter methylation: Anticorrelation r = -0.41
- Epigenetically regulated genes: 1,256 genes (p < 0.01)
- Example: BRCA1 promoter hypermethylation → 3-fold reduced expression
- Promoter methylation: Anticorrelation r = -0.41
- Epigenetically regulated genes: 1,256 genes (p < 0.01)
- Example: BRCA1 promoter hypermethylation → 3-fold reduced expression
CNV-Expression Dosage Effect
CNV-Expression Dosage Effect
- Genes with dosage effect: 445 genes (r > 0.5, p < 0.01)
- Example: MYC amplification (3 copies) → 2.8-fold increased expression
- Genes with dosage effect: 445 genes (r > 0.5, p < 0.01)
- Example: MYC amplification (3 copies) → 2.8-fold increased expression
Multi-Omics Clustering
Multi-Omics Clustering
MOFA+ Analysis
MOFA+ Analysis
- Factor 1 (25% variance): Cell cycle genes (RNA + protein)
- Factor 2 (18% variance): Immune signature (RNA + methylation)
- Factor 3 (15% variance): Metabolic reprogramming (RNA + metabolites)
- Factor 1 (25% variance): Cell cycle genes (RNA + protein)
- Factor 2 (18% variance): Immune signature (RNA + methylation)
- Factor 3 (15% variance): Metabolic reprogramming (RNA + metabolites)
Patient Subtypes
Patient Subtypes
- Subtype 1 (n=18): High proliferation, MYC amplification
- Subtype 2 (n=15): Immune-enriched, hypomethylation
- Subtype 3 (n=12): Metabolic dysregulation, mitochondrial dysfunction
- Subtype 1 (n=18): High proliferation, MYC amplification
- Subtype 2 (n=15): Immune-enriched, hypomethylation
- Subtype 3 (n=12): Metabolic dysregulation, mitochondrial dysfunction
Pathway Integration
Pathway Integration
Top Dysregulated Pathways (Multi-Omics Score)
Top Dysregulated Pathways (Multi-Omics Score)
- Cell Cycle (score: 8.5) - RNA (↑), Protein (↑), CNV (amplification)
- Immune Response (score: 7.2) - RNA (↑), Methylation (hypo)
- Glycolysis (score: 6.8) - RNA (↑), Metabolites (↑)
- Cell Cycle (score: 8.5) - RNA (↑), Protein (↑), CNV (amplification)
- Immune Response (score: 7.2) - RNA (↑), Methylation (hypo)
- Glycolysis (score: 6.8) - RNA (↑), Metabolites (↑)
Multi-Omics Biomarkers
Multi-Omics Biomarkers
Classification Performance
Classification Performance
- AUC: 0.92 ± 0.04 (5-fold CV)
- Features: 50 total (20 RNA, 15 protein, 10 methylation, 5 CNV)
- Top biomarkers:
- MYC expression (RNA)
- CDK1 protein abundance
- BRCA1 promoter methylation
- TP53 CNV status
- AUC: 0.92 ± 0.04 (5-fold CV)
- Features: 50 total (20 RNA, 15 protein, 10 methylation, 5 CNV)
- Top biomarkers:
- MYC expression (RNA)
- CDK1 protein abundance
- BRCA1 promoter methylation
- TP53 CNV status
Biological Interpretation
Biological Interpretation
The multi-omics analysis reveals three distinct disease subtypes driven by different molecular mechanisms:
-
Proliferative subtype: Characterized by MYC amplification driving coordinated upregulation of cell cycle genes at both RNA and protein levels.
-
Immune subtype: Hypomethylation of immune genes leading to increased expression and T-cell infiltration.
-
Metabolic subtype: Shift from oxidative phosphorylation to glycolysis, with concordant changes in enzyme expression and metabolite levels.
These subtypes may respond differently to targeted therapies.
---The multi-omics analysis reveals three distinct disease subtypes driven by different molecular mechanisms:
-
Proliferative subtype: Characterized by MYC amplification driving coordinated upregulation of cell cycle genes at both RNA and protein levels.
-
Immune subtype: Hypomethylation of immune genes leading to increased expression and T-cell infiltration.
-
Metabolic subtype: Shift from oxidative phosphorylation to glycolysis, with concordant changes in enzyme expression and metabolite levels.
These subtypes may respond differently to targeted therapies.
---ToolUniverse Skills Coordination
ToolUniverse技能协调
This skill orchestrates multiple specialized skills:
| Skill | Used For | Phase |
|---|---|---|
| Load and analyze RNA-seq data | Phase 1, 4 |
| Methylation analysis, ChIP-seq peaks | Phase 1, 4 |
| CNV and SNV processing | Phase 1, 3, 4 |
| Protein network context | Phase 6 |
| Pathway enrichment | Phase 6 |
| Public omics data retrieval | Phase 1 |
| Gene/protein annotation | Phase 3, 8 |
该技能统筹多个专项技能:
| 技能 | 用途 | 阶段 |
|---|---|---|
| 加载并分析RNA-seq数据 | 阶段1、4 |
| 甲基化分析、ChIP-seq峰处理 | 阶段1、4 |
| CNV和SNV处理 | 阶段1、3、4 |
| 蛋白质网络背景分析 | 阶段6 |
| 通路富集分析 | 阶段6 |
| 公共组学数据获取 | 阶段1 |
| 基因/蛋白质注释 | 阶段3、8 |
Example Use Cases
示例用例
Use Case 1: Cancer Multi-Omics
用例1:癌症多组学分析
Question: "Integrate TCGA breast cancer RNA-seq, proteomics, methylation, and CNV data"
Workflow:
- Load 4 omics types for 500 patients
- Match samples (450 common across all omics)
- Correlate RNA-protein (identify translation-regulated genes)
- Correlate methylation-expression (find epigenetically silenced genes)
- Correlate CNV-expression (identify dosage-sensitive genes)
- Run MOFA+ to find latent factors
- Identify 4 subtypes with distinct multi-omics profiles
- Perform pathway enrichment per subtype
- Select multi-omics biomarkers (AUC=0.94)
问题:"整合TCGA乳腺癌RNA-seq、蛋白质组学、甲基化和CNV数据"
工作流程:
- 为500名患者加载4种组学数据
- 匹配样本(450个样本在所有组学中均存在)
- 关联RNA与蛋白质(识别受翻译调控的基因)
- 关联甲基化与表达(找到表观遗传沉默的基因)
- 关联CNV与表达(识别剂量敏感基因)
- 运行MOFA+以找到潜在因子
- 识别具有不同多组学特征的4种亚型
- 对每个亚型执行通路富集分析
- 选择多组学生物标志物(AUC=0.94)
Use Case 2: eQTL + Expression
用例2:eQTL + 表达分析
Question: "How do GWAS variants affect gene expression through methylation?"
Workflow:
- Load genotype data (SNPs from GWAS)
- Load expression data (RNA-seq)
- Load methylation data (450K array)
- For each GWAS SNP:
- Test association with nearby gene expression (eQTL)
- Test association with nearby CpG methylation (meQTL)
- Test CpG-gene correlation
- Identify SNP → methylation → expression regulatory chains
- Annotate with ToolUniverse (GWAS traits, gene function)
问题:"GWAS变异如何通过甲基化影响基因表达?"
工作流程:
- 加载基因型数据(来自GWAS的SNPs)
- 加载表达数据(RNA-seq)
- 加载甲基化数据(450K芯片)
- 对每个GWAS SNP:
- 测试与邻近基因表达的关联(eQTL)
- 测试与邻近CpG甲基化的关联(meQTL)
- 测试CpG与基因的关联
- 识别SNP → 甲基化 → 表达的调控链
- 使用ToolUniverse进行注释(GWAS性状、基因功能)
Use Case 3: Drug Response Multi-Omics
用例3:药物响应多组学分析
Question: "Predict drug response using multi-omics profiles"
Workflow:
- Load baseline multi-omics (pre-treatment)
- Load drug response data (IC50 or clinical response)
- Correlate each omics with response
- Select multi-omics features predictive of response
- Train multi-omics classifier
- Identify pathways associated with resistance/sensitivity
- Use ToolUniverse drug-repurposing skill for alternative options
问题:"使用多组学特征预测药物响应"
工作流程:
- 加载基线多组学数据(治疗前)
- 加载药物响应数据(IC50或临床响应)
- 关联每组学数据与响应情况
- 选择可预测响应的多组学特征
- 训练多组学分类器
- 识别与耐药/敏感相关的通路
- 使用ToolUniverse药物重定位技能寻找替代方案
Advanced Analysis Patterns
高级分析模式
Pattern 1: Omics-Driven Patient Stratification
模式1:组学驱动的患者分层
For precision medicine applications where patient stratification is goal.
适用于以患者分层为目标的精准医疗应用。
Pattern 2: Multi-Omics Network Analysis
模式2:多组学网络分析
Build integrated networks combining PPI, co-expression, regulatory interactions.
构建整合PPI、共表达、调控互作的网络。
Pattern 3: Temporal Multi-Omics
模式3:时间序列多组学
Longitudinal multi-omics data (time-series or treatment response).
纵向多组学数据(时间序列或治疗响应)。
Pattern 4: Spatial Multi-Omics
模式4:空间多组学
Spatial transcriptomics + proteomics for tissue architecture.
空间转录组学 + 蛋白质组学,用于分析组织架构。
Quantified Minimums
最低量化要求
| Component | Requirement |
|---|---|
| Omics types | At least 2 omics datasets |
| Common samples | At least 10 samples across omics |
| Cross-correlation | Pearson/Spearman correlation computed |
| Clustering | At least one method (MOFA+, NMF, or SNF) |
| Pathway integration | Enrichment with multi-omics evidence scores |
| Report | Summary, correlations, clusters, pathways, biomarkers |
| 组件 | 要求 |
|---|---|
| 组学类型 | 至少2组学数据集 |
| 共同样本 | 跨组学至少10个样本 |
| 跨组关联 | 计算Pearson/Spearman相关系数 |
| 聚类分析 | 至少使用一种方法(MOFA+、NMF或SNF) |
| 通路整合 | 使用多组学证据评分进行富集分析 |
| 报告 | 包含摘要、关联结果、聚类、通路、生物标志物 |
Limitations
局限性
- Sample size: Multi-omics integration requires sufficient samples (n≥20 recommended)
- Missing data: Some patients may not have all omics types
- Batch effects: Different omics platforms/batches require careful normalization
- Computational: Large multi-omics datasets may require significant memory/compute
- Interpretation: Multi-omics results require domain expertise for biological validation
- 样本量:多组学整合需要足够的样本量(建议n≥20)
- 缺失数据:部分患者可能不具备全部组学数据
- 批次效应:不同组学平台/批次的数据需要仔细校正
- 计算资源:大型多组学数据集可能需要大量内存和计算能力
- 解读难度:多组学结果需要领域专业知识进行生物学验证
References
参考文献
Methods:
- MOFA+: https://doi.org/10.1186/s13059-020-02015-1
- Similarity Network Fusion: https://doi.org/10.1038/nmeth.2810
- Multi-omics review: https://doi.org/10.1038/s41576-019-0093-7
ToolUniverse Skills:
- See individual skill documentation for omics-specific methods
方法:
- MOFA+: https://doi.org/10.1186/s13059-020-02015-1
- Similarity Network Fusion: https://doi.org/10.1038/nmeth.2810
- Multi-omics review: https://doi.org/10.1038/s41576-019-0093-7
ToolUniverse技能:
- 请参阅各组学专项技能的文档了解具体方法