Loading...
Loading...
Integrate and analyze multiple omics datasets (transcriptomics, proteomics, epigenomics, genomics, metabolomics) for systems biology and precision medicine. Performs cross-omics correlation, multi-omics clustering (MOFA+, NMF), pathway-level integration, and sample matching. Coordinates ToolUniverse skills for expression data (RNA-seq), epigenomics (methylation, ChIP-seq), variants (SNVs, CNVs), protein interactions, and pathway enrichment. Use when analyzing multi-omics datasets, performing integrative analysis, discovering multi-omics biomarkers, studying disease mechanisms across molecular layers, or conducting systems biology research that requires coordinated analysis of transcriptome, genome, epigenome, proteome, and metabolome data.
npx skill4agent add mims-harvard/tooluniverse tooluniverse-multi-omics-integration| Capability | Description |
|---|---|
| Data Integration | Match samples across omics, handle missing data, normalize scales |
| Cross-Omics Correlation | Correlate features across molecular layers (gene expression vs protein, methylation vs expression) |
| Multi-Omics Clustering | MOFA+, NMF, joint clustering to identify omics-driven subtypes |
| Pathway Integration | Combine omics evidence at pathway level for unified biological interpretation |
| Biomarker Discovery | Identify multi-omics signatures with improved predictive power |
| Skill Coordination | Orchestrate RNA-seq, epigenomics, variant-analysis, protein-interactions, gene-enrichment skills |
| Visualization | Circos plots, integrated heatmaps, network visualizations |
| Reporting | Unified multi-omics reports with cross-layer insights |
Input: Multiple Omics Datasets
|
v
Phase 1: Data Loading & QC
|-- Load RNA-seq (expression matrix)
|-- Load proteomics (protein abundance)
|-- Load methylation (beta values or M-values)
|-- Load variants (CNV, SNV from VCF)
|-- Load metabolomics (metabolite abundance)
|-- Quality control per omics type
|
v
Phase 2: Sample Matching
|-- Match samples across omics by ID
|-- Identify common samples
|-- Handle batch effects
|-- Normalize sample identifiers
|
v
Phase 3: Feature Mapping
|-- Map features to common identifier space (genes, proteins, metabolites)
|-- Link CpG sites to genes (promoter, gene body)
|-- Map variants to genes
|-- Create unified feature matrix
|
v
Phase 4: Cross-Omics Correlation
|-- Gene expression vs protein abundance (translation efficiency)
|-- Promoter methylation vs expression (epigenetic regulation)
|-- CNV vs expression (dosage effect)
|-- eQTL variants vs expression (genetic regulation)
|-- Metabolite vs enzyme expression (metabolic flux)
|
v
Phase 5: Multi-Omics Clustering
|-- MOFA+ (Multi-Omics Factor Analysis) for latent factors
|-- NMF (Non-negative Matrix Factorization) for patient subtypes
|-- Joint clustering across omics
|-- Identify omics-specific vs shared variation
|
v
Phase 6: Pathway-Level Integration
|-- Aggregate omics to pathway level
|-- Score pathway dysregulation (combined evidence)
|-- Use ToolUniverse enrichment tools (Reactome, KEGG, GO)
|-- Identify driver pathways across omics
|
v
Phase 7: Biomarker Discovery
|-- Feature selection across omics
|-- Multi-omics signatures for classification
|-- Cross-validation and performance
|-- Interpretation and biological validation
|
v
Phase 8: Generate Integrated Report
|-- Summary statistics per omics
|-- Cross-omics correlation results
|-- Multi-omics clusters and subtypes
|-- Top dysregulated pathways
|-- Multi-omics biomarkers
|-- Biological interpretation# RNA-seq QC
- Filter low-count genes (mean counts < threshold)
- Normalize (TPM, FPKM, or DESeq2)
- Log-transform for correlation
# Proteomics QC
- Filter proteins with high missing values
- Impute missing values (minimum, KNN)
- Normalize (median, quantile)
# Methylation QC
- Remove failed probes
- Correct for batch effects (ComBat)
- Filter cross-reactive probes
# Variants QC
- Use variant-analysis skill for VCF QC
- CNV segmentation validationdef match_samples_across_omics(omics_data_dict):
"""
Match samples across multiple omics datasets.
Parameters:
omics_data_dict: {
'rnaseq': DataFrame (genes x samples),
'proteomics': DataFrame (proteins x samples),
'methylation': DataFrame (CpGs x samples),
'cnv': DataFrame (genes x samples)
}
Returns:
- common_samples: List of sample IDs present in all omics
- matched_data: Dict of DataFrames with common samples only
"""
# Extract sample IDs from each omics
sample_ids = {
omics_type: set(df.columns)
for omics_type, df in omics_data_dict.items()
}
# Find common samples (intersection)
common_samples = set.intersection(*sample_ids.values())
# Subset each omics to common samples
matched_data = {
omics_type: df[sorted(common_samples)]
for omics_type, df in omics_data_dict.items()
}
return sorted(common_samples), matched_data# Map all features to genes
feature_mapping = {
'rnaseq': 'gene_symbol', # Already gene-level
'proteomics': 'gene_symbol', # Map protein to gene
'methylation': 'gene_symbol', # Map CpG to gene (promoter)
'cnv': 'gene_symbol', # CNV regions to overlapping genes
'metabolomics': 'enzyme_gene' # Metabolite to enzyme gene
}def correlate_rna_protein(rnaseq_data, proteomics_data):
"""
Correlate mRNA and protein levels for each gene.
Expected: Positive correlation (r ~ 0.4-0.6 typical)
Discordance indicates post-transcriptional regulation
"""
# Find common genes
common_genes = set(rnaseq_data.index) & set(proteomics_data.index)
correlations = {}
for gene in common_genes:
rna = rnaseq_data.loc[gene]
protein = proteomics_data.loc[gene]
# Spearman correlation (robust to outliers)
r, p = spearmanr(rna, protein)
correlations[gene] = {'r': r, 'p': p}
# Identify discordant genes (low RNA-protein correlation)
discordant = {g: v for g, v in correlations.items() if abs(v['r']) < 0.2}
return correlations, discordantdef correlate_methylation_expression(methylation_data, rnaseq_data):
"""
Correlate promoter methylation with gene expression.
Expected: Negative correlation (increased methylation → decreased expression)
"""
# For each gene with promoter methylation
results = {}
for gene in methylation_data.index:
if gene in rnaseq_data.index:
meth = methylation_data.loc[gene] # Average promoter beta
expr = rnaseq_data.loc[gene]
r, p = spearmanr(meth, expr)
results[gene] = {'r': r, 'p': p, 'direction': 'repressive' if r < 0 else 'activating'}
# Identify genes with strong methylation-expression anticorrelation
regulated = {g: v for g, v in results.items() if v['r'] < -0.5 and v['p'] < 0.01}
return results, regulateddef correlate_cnv_expression(cnv_data, rnaseq_data):
"""
Correlate copy number with gene expression.
Expected: Positive correlation (gene dosage effect)
"""
results = {}
for gene in cnv_data.index:
if gene in rnaseq_data.index:
cnv = cnv_data.loc[gene] # log2 ratio
expr = rnaseq_data.loc[gene]
r, p = pearsonr(cnv, expr)
results[gene] = {'r': r, 'p': p}
# Genes with dosage effect (CNV drives expression)
dosage_genes = {g: v for g, v in results.items() if v['r'] > 0.5 and v['p'] < 0.01}
return results, dosage_genes# Conceptual workflow (uses R's MOFA2 package or Python implementation)
# 1. Prepare multi-omics data as list of matrices
# 2. Run MOFA+ to identify factors
# 3. Inspect factor variance explained per omics
# 4. Cluster samples based on factor scores
# Example interpretation:
# Factor 1: Explains 40% variance in RNA-seq, 30% in proteomics → Cell proliferation
# Factor 2: Explains 50% variance in methylation → Epigenetic subtype
# Factor 3: Explains 20% variance in CNV → Genomic instabilitydef joint_nmf_clustering(omics_data_dict, n_clusters=3):
"""
Perform joint NMF across omics for clustering.
Returns patient cluster assignments based on shared factors.
"""
# Concatenate omics matrices (after normalization)
combined_matrix = np.vstack([
omics_data_dict['rnaseq'].values,
omics_data_dict['proteomics'].values,
omics_data_dict['methylation'].values
])
# Run NMF
from sklearn.decomposition import NMF
model = NMF(n_components=n_clusters, init='nndsvd', random_state=42)
W = model.fit_transform(combined_matrix) # Feature loadings
H = model.components_ # Sample coefficients
# Cluster samples based on H (components)
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters=n_clusters).fit_predict(H.T)
return clusters, W, Hdef integrate_pathway_evidence(omics_results, pathway_genes):
"""
Score pathway dysregulation across omics.
omics_results: {
'rnaseq': {'gene': fold_change},
'proteomics': {'gene': fold_change},
'methylation': {'gene': methylation_diff},
'cnv': {'gene': copy_number}
}
pathway_genes: List of genes in pathway
"""
# For each gene in pathway
pathway_scores = []
for gene in pathway_genes:
gene_score = 0
evidence_count = 0
# RNA-seq evidence
if gene in omics_results['rnaseq']:
gene_score += abs(omics_results['rnaseq'][gene])
evidence_count += 1
# Proteomics evidence
if gene in omics_results['proteomics']:
gene_score += abs(omics_results['proteomics'][gene])
evidence_count += 1
# Methylation evidence (negative correlation)
if gene in omics_results['methylation']:
gene_score += abs(omics_results['methylation'][gene])
evidence_count += 1
# CNV evidence
if gene in omics_results['cnv']:
gene_score += abs(omics_results['cnv'][gene])
evidence_count += 1
if evidence_count > 0:
pathway_scores.append(gene_score / evidence_count)
# Aggregate pathway score (mean of gene scores)
pathway_score = np.mean(pathway_scores) if pathway_scores else 0
return {
'pathway_score': pathway_score,
'n_genes_with_evidence': len(pathway_scores),
'n_omics_types': evidence_count
}# Get pathways for gene set
from tooluniverse import ToolUniverse
tu = ToolUniverse()
# Enrichment for genes dysregulated in ANY omics
all_dysregulated_genes = set()
all_dysregulated_genes.update(rnaseq_degs)
all_dysregulated_genes.update(diff_proteins)
all_dysregulated_genes.update(methylation_dmgs)
# Run enrichment
enrichment = tu.run_one_function({
"name": "enrichr_enrich",
"arguments": {
"gene_list": ",".join(all_dysregulated_genes),
"library": "KEGG_2021_Human"
}
})
# Score each pathway with multi-omics evidence
for pathway in enrichment['data']['results']:
pathway_genes = pathway['genes']
pathway['multi_omics_score'] = integrate_pathway_evidence(
omics_results, pathway_genes
)def select_multiomics_features(X_dict, y, n_features=50):
"""
Select top features across omics for classification.
X_dict: {
'rnaseq': DataFrame (samples x genes),
'proteomics': DataFrame (samples x proteins),
'methylation': DataFrame (samples x CpGs)
}
y: Target labels (disease vs control)
Returns: Selected features per omics
"""
from sklearn.feature_selection import SelectKBest, f_classif
selected_features = {}
for omics_type, X in X_dict.items():
selector = SelectKBest(f_classif, k=min(n_features, X.shape[1]))
selector.fit(X, y)
# Get selected feature names
selected_idx = selector.get_support()
selected_features[omics_type] = X.columns[selected_idx].tolist()
return selected_featuresdef multiomics_classification(X_dict, y, selected_features):
"""
Train classifier using multi-omics features.
"""
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# Concatenate selected features from each omics
X_combined = []
for omics_type, features in selected_features.items():
X_combined.append(X_dict[omics_type][features])
X_combined = pd.concat(X_combined, axis=1)
# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(clf, X_combined, y, cv=5, scoring='roc_auc')
return {
'mean_auc': scores.mean(),
'std_auc': scores.std(),
'n_features': X_combined.shape[1],
'features_per_omics': {k: len(v) for k, v in selected_features.items()}
}# Multi-Omics Integration Report
## Dataset Summary
- **Omics Types**: RNA-seq, Proteomics, Methylation, CNV
- **Common Samples**: 45 patients (30 disease, 15 control)
- **Features**: 15,000 genes, 5,000 proteins, 450K CpGs, 20K CNV regions
## Cross-Omics Correlation
### RNA-Protein Correlation
- **Overall correlation**: r = 0.52 (expected: 0.4-0.6)
- **Highly correlated**: 3,245 genes (45%)
- **Discordant genes**: 890 genes (post-transcriptional regulation)
### Methylation-Expression
- **Promoter methylation**: Anticorrelation r = -0.41
- **Epigenetically regulated genes**: 1,256 genes (p < 0.01)
- **Example**: BRCA1 promoter hypermethylation → 3-fold reduced expression
### CNV-Expression Dosage Effect
- **Genes with dosage effect**: 445 genes (r > 0.5, p < 0.01)
- **Example**: MYC amplification (3 copies) → 2.8-fold increased expression
## Multi-Omics Clustering
### MOFA+ Analysis
- **Factor 1** (25% variance): Cell cycle genes (RNA + protein)
- **Factor 2** (18% variance): Immune signature (RNA + methylation)
- **Factor 3** (15% variance): Metabolic reprogramming (RNA + metabolites)
### Patient Subtypes
- **Subtype 1** (n=18): High proliferation, MYC amplification
- **Subtype 2** (n=15): Immune-enriched, hypomethylation
- **Subtype 3** (n=12): Metabolic dysregulation, mitochondrial dysfunction
## Pathway Integration
### Top Dysregulated Pathways (Multi-Omics Score)
1. **Cell Cycle** (score: 8.5) - RNA (↑), Protein (↑), CNV (amplification)
2. **Immune Response** (score: 7.2) - RNA (↑), Methylation (hypo)
3. **Glycolysis** (score: 6.8) - RNA (↑), Metabolites (↑)
## Multi-Omics Biomarkers
### Classification Performance
- **AUC**: 0.92 ± 0.04 (5-fold CV)
- **Features**: 50 total (20 RNA, 15 protein, 10 methylation, 5 CNV)
- **Top biomarkers**:
- MYC expression (RNA)
- CDK1 protein abundance
- BRCA1 promoter methylation
- TP53 CNV status
## Biological Interpretation
The multi-omics analysis reveals three distinct disease subtypes driven by different molecular mechanisms:
1. **Proliferative subtype**: Characterized by MYC amplification driving coordinated upregulation of cell cycle genes at both RNA and protein levels.
2. **Immune subtype**: Hypomethylation of immune genes leading to increased expression and T-cell infiltration.
3. **Metabolic subtype**: Shift from oxidative phosphorylation to glycolysis, with concordant changes in enzyme expression and metabolite levels.
These subtypes may respond differently to targeted therapies.| Skill | Used For | Phase |
|---|---|---|
| Load and analyze RNA-seq data | Phase 1, 4 |
| Methylation analysis, ChIP-seq peaks | Phase 1, 4 |
| CNV and SNV processing | Phase 1, 3, 4 |
| Protein network context | Phase 6 |
| Pathway enrichment | Phase 6 |
| Public omics data retrieval | Phase 1 |
| Gene/protein annotation | Phase 3, 8 |
| Component | Requirement |
|---|---|
| Omics types | At least 2 omics datasets |
| Common samples | At least 10 samples across omics |
| Cross-correlation | Pearson/Spearman correlation computed |
| Clustering | At least one method (MOFA+, NMF, or SNF) |
| Pathway integration | Enrichment with multi-omics evidence scores |
| Report | Summary, correlations, clusters, pathways, biomarkers |