tooluniverse-metabolomics-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMetabolomics Analysis
代谢组学数据分析
Comprehensive analysis of metabolomics data from metabolite identification through quantification, statistical analysis, pathway interpretation, and integration with other omics layers.
对代谢组学数据进行全面分析,涵盖从代谢物鉴定、定量、统计分析、通路解析到与其他组学层面整合的全流程。
When to Use This Skill
何时使用本技能
Triggers:
- User has metabolomics data (LC-MS, GC-MS, NMR)
- Questions about metabolite abundance or concentrations
- Differential metabolite analysis requests
- Metabolic pathway analysis
- Multi-omics integration with metabolomics
- Metabolic biomarker discovery
- Flux balance analysis or metabolic modeling
- Metabolite-enzyme correlation
Example Questions This Skill Solves:
- "Analyze this LC-MS metabolomics data for differential metabolites"
- "Which metabolic pathways are dysregulated between conditions?"
- "Identify metabolite biomarkers for disease classification"
- "Correlate metabolite levels with enzyme expression"
- "Perform pathway enrichment for differential metabolites"
- "Integrate metabolomics with transcriptomics data"
- "Characterize the metabolic phenotype of this cell line"
- "Identify metabolites associated with drug response"
触发场景:
- 用户拥有代谢组学数据(LC-MS、GC-MS、NMR)
- 关于代谢物丰度或浓度的问题
- 差异代谢物分析请求
- 代谢通路分析
- 代谢组学与多组学整合
- 代谢生物标志物发现
- 通量平衡分析或代谢建模
- 代谢物-酶相关性分析
本技能可解决的示例问题:
- "分析这份LC-MS代谢组学数据,找出差异代谢物"
- "不同条件下哪些代谢通路出现失调?"
- "鉴定用于疾病分类的代谢生物标志物"
- "将代谢物水平与酶表达进行关联分析"
- "对差异代谢物进行通路富集分析"
- "整合代谢组学与转录组学数据"
- "表征该细胞系的代谢表型"
- "鉴定与药物响应相关的代谢物"
Core Capabilities
核心能力
| Capability | Description |
|---|---|
| Data Import | LC-MS, GC-MS, NMR, targeted/untargeted platforms |
| Metabolite Identification | Match to HMDB, KEGG, PubChem, spectral libraries |
| Quality Control | Peak quality, blank subtraction, internal standard normalization |
| Normalization | Probabilistic quotient, total ion current, internal standards |
| Statistical Analysis | Univariate and multivariate (PCA, PLS-DA, OPLS-DA) |
| Differential Analysis | Identify significant metabolite changes |
| Pathway Enrichment | KEGG, Reactome, BioCyc metabolic pathway analysis |
| Metabolite-Enzyme Integration | Correlate with expression data |
| Flux Analysis | Metabolic flux balance analysis (FBA) |
| Biomarker Discovery | Multi-metabolite signatures |
| 能力 | 描述 |
|---|---|
| 数据导入 | 支持LC-MS、GC-MS、NMR及靶向/非靶向平台数据 |
| 代谢物鉴定 | 匹配HMDB、KEGG、PubChem及谱图库 |
| 质量控制 | 峰质量评估、空白扣除、内标归一化 |
| 归一化 | 概率商归一化、总离子流归一化、内标归一化 |
| 统计分析 | 单变量及多变量分析(PCA、PLS-DA、OPLS-DA) |
| 差异分析 | 鉴定丰度显著变化的代谢物 |
| 通路富集 | KEGG、Reactome、BioCyc代谢通路分析 |
| 代谢物-酶整合 | 与表达数据进行关联分析 |
| 通量分析 | 代谢通量平衡分析(FBA) |
| 生物标志物发现 | 多代谢物特征分析 |
Workflow Overview
工作流程概览
Input: Metabolomics Data (Peak Table or Spectra)
|
v
Phase 1: Data Import & Metabolite Identification
|-- Load peak table or process raw spectra
|-- Match features to metabolite databases (HMDB, KEGG)
|-- Annotate with chemical IDs, formulas, pathways
|-- Confidence scoring for IDs
|
v
Phase 2: Quality Control & Filtering
|-- Assess peak quality (CV, blank ratios)
|-- Remove background peaks
|-- Filter low-quality metabolites
|-- Internal standard check
|
v
Phase 3: Normalization
|-- Sample-wise normalization (TIC, PQN, internal standards)
|-- Batch effect correction
|-- Log-transform or scaling
|
v
Phase 4: Exploratory Analysis
|-- PCA for sample clustering
|-- Quality assessment plots
|-- Outlier detection
|-- Sample correlation
|
v
Phase 5: Differential Analysis
|-- Statistical testing (t-test, ANOVA, Wilcoxon)
|-- Fold change calculation
|-- Multiple testing correction
|-- Volcano plots, heatmaps
|
v
Phase 6: Pathway Analysis
|-- Metabolite set enrichment (MSEA)
|-- Pathway topology analysis
|-- KEGG/Reactome pathway mapping
|-- Identify dysregulated pathways
|
v
Phase 7: Multi-Omics Integration
|-- Correlate with enzyme expression (RNA/protein)
|-- Metabolite-gene associations
|-- Pathway-level integration
|-- Metabolic flux inference
|
v
Phase 8: Generate Report
|-- Summary statistics
|-- Differential metabolites
|-- Pathway diagrams
|-- Multi-omics integration plots
|-- Biomarker panelInput: Metabolomics Data (Peak Table or Spectra)
|
v
Phase 1: Data Import & Metabolite Identification
|-- Load peak table or process raw spectra
|-- Match features to metabolite databases (HMDB, KEGG)
|-- Annotate with chemical IDs, formulas, pathways
|-- Confidence scoring for IDs
|
v
Phase 2: Quality Control & Filtering
|-- Assess peak quality (CV, blank ratios)
|-- Remove background peaks
|-- Filter low-quality metabolites
|-- Internal standard check
|
v
Phase 3: Normalization
|-- Sample-wise normalization (TIC, PQN, internal standards)
|-- Batch effect correction
|-- Log-transform or scaling
|
v
Phase 4: Exploratory Analysis
|-- PCA for sample clustering
|-- Quality assessment plots
|-- Outlier detection
|-- Sample correlation
|
v
Phase 5: Differential Analysis
|-- Statistical testing (t-test, ANOVA, Wilcoxon)
|-- Fold change calculation
|-- Multiple testing correction
|-- Volcano plots, heatmaps
|
v
Phase 6: Pathway Analysis
|-- Metabolite set enrichment (MSEA)
|-- Pathway topology analysis
|-- KEGG/Reactome pathway mapping
|-- Identify dysregulated pathways
|
v
Phase 7: Multi-Omics Integration
|-- Correlate with enzyme expression (RNA/protein)
|-- Metabolite-gene associations
|-- Pathway-level integration
|-- Metabolic flux inference
|
v
Phase 8: Generate Report
|-- Summary statistics
|-- Differential metabolites
|-- Pathway diagrams
|-- Multi-omics integration plots
|-- Biomarker panelPhase Details
各阶段详情
Phase 1: Data Import & Metabolite Identification
阶段1:数据导入与代谢物鉴定
Objective: Load data and identify metabolites from features.
Supported data types:
- Peak tables: Pre-processed metabolite abundance
- Raw spectra: LC-MS (.mzML, .mzXML), GC-MS
- NMR spectra: 1D/2D NMR data
Peak table format (typical):
Sample_ID | Glucose | Lactate | Glutamine | ... | Cholesterol
----------|---------|---------|-----------|-----|------------
Control_1 | 125000 | 45000 | 78000 | ... | 23000
Control_2 | 130000 | 43000 | 82000 | ... | 25000
Disease_1 | 85000 | 92000 | 45000 | ... | 45000Data loading:
python
def load_metabolomics_data(file_path, file_type='peak_table'):
"""
Load metabolomics data.
file_type options:
- 'peak_table': CSV/TSV with metabolites as columns
- 'mzml': Raw LC-MS data (requires processing)
- 'nmr': NMR spectra
"""
import pandas as pd
if file_type == 'peak_table':
# Load peak table
data = pd.read_csv(file_path, index_col=0)
# Rows = samples, Columns = metabolites
return data
elif file_type == 'mzml':
# Process raw MS data (requires pymzml or similar)
# Peak detection, alignment, quantification
passMetabolite identification:
python
def identify_metabolites(feature_data, mass_list, rt_list=None):
"""
Match features to metabolite databases.
Uses accurate mass and retention time (if available).
Queries: HMDB, KEGG Compound, PubChem
"""
from tooluniverse import ToolUniverse
tu = ToolUniverse()
identified_metabolites = []
for i, mass in enumerate(mass_list):
# Query HMDB by mass (±5 ppm tolerance)
hmdb_result = tu.run_one_function({
"name": "hmdb_search_by_mass",
"arguments": {
"mass": mass,
"mass_tolerance": 0.005 # 5 ppm
}
})
if hmdb_result and 'data' in hmdb_result:
matches = hmdb_result['data']
# Rank by confidence
# (exact mass match, pathway context, etc.)
identified_metabolites.append({
'feature_id': i,
'metabolite_name': matches[0]['name'],
'hmdb_id': matches[0]['accession'],
'formula': matches[0]['chemical_formula'],
'confidence': calculate_confidence(matches[0])
})
return identified_metabolitesConfidence scoring:
Level 1: Confirmed with authentic standard (MS + RT match)
Level 2: Probable structure (accurate mass + MS/MS)
Level 3: Tentative match (accurate mass only)
Level 4: Unknown metabolite目标:加载数据并从特征中鉴定代谢物。
支持的数据类型:
- 峰表:预处理后的代谢物丰度数据
- 原始谱图:LC-MS (.mzML, .mzXML)、GC-MS数据
- NMR谱图:1D/2D NMR数据
典型峰表格式:
Sample_ID | Glucose | Lactate | Glutamine | ... | Cholesterol
----------|---------|---------|-----------|-----|------------
Control_1 | 125000 | 45000 | 78000 | ... | 23000
Control_2 | 130000 | 43000 | 82000 | ... | 25000
Disease_1 | 85000 | 92000 | 45000 | ... | 45000数据加载:
python
def load_metabolomics_data(file_path, file_type='peak_table'):
"""
Load metabolomics data.
file_type options:
- 'peak_table': CSV/TSV with metabolites as columns
- 'mzml': Raw LC-MS data (requires processing)
- 'nmr': NMR spectra
"""
import pandas as pd
if file_type == 'peak_table':
# Load peak table
data = pd.read_csv(file_path, index_col=0)
# Rows = samples, Columns = metabolites
return data
elif file_type == 'mzml':
# Process raw MS data (requires pymzml or similar)
# Peak detection, alignment, quantification
pass代谢物鉴定:
python
def identify_metabolites(feature_data, mass_list, rt_list=None):
"""
Match features to metabolite databases.
Uses accurate mass and retention time (if available).
Queries: HMDB, KEGG Compound, PubChem
"""
from tooluniverse import ToolUniverse
tu = ToolUniverse()
identified_metabolites = []
for i, mass in enumerate(mass_list):
# Query HMDB by mass (±5 ppm tolerance)
hmdb_result = tu.run_one_function({
"name": "hmdb_search_by_mass",
"arguments": {
"mass": mass,
"mass_tolerance": 0.005 # 5 ppm
}
})
if hmdb_result and 'data' in hmdb_result:
matches = hmdb_result['data']
# Rank by confidence
# (exact mass match, pathway context, etc.)
identified_metabolites.append({
'feature_id': i,
'metabolite_name': matches[0]['name'],
'hmdb_id': matches[0]['accession'],
'formula': matches[0]['chemical_formula'],
'confidence': calculate_confidence(matches[0])
})
return identified_metabolites置信度评分:
Level 1: Confirmed with authentic standard (MS + RT match)
Level 2: Probable structure (accurate mass + MS/MS)
Level 3: Tentative match (accurate mass only)
Level 4: Unknown metabolitePhase 2: Quality Control & Filtering
阶段2:质量控制与过滤
Objective: Remove low-quality features and background noise.
Quality control metrics:
python
def metabolomics_qc(data, sample_metadata):
"""
Quality control for metabolomics data.
QC metrics:
- Coefficient of variation (CV) in QC samples
- Blank ratios (signal in samples vs blanks)
- Missing values per metabolite
- Total ion current per sample
"""
# 1. CV in QC samples (should be < 30%)
qc_samples = sample_metadata['sample_type'] == 'QC'
qc_data = data[qc_samples]
cv_per_metabolite = qc_data.std() / qc_data.mean()
high_cv = cv_per_metabolite > 0.3
print(f"Metabolites with CV > 30%: {high_cv.sum()}")
# 2. Blank subtraction
blank_samples = sample_metadata['sample_type'] == 'Blank'
blank_data = data[blank_samples]
# Calculate blank ratios
blank_means = blank_data.mean()
sample_means = data[~blank_samples & ~qc_samples].mean()
blank_ratio = sample_means / blank_means
# Filter: keep metabolites with sample/blank > 3
keep_metabolites = blank_ratio > 3
# 3. Missing values (remove if >50% missing)
missing_per_metabolite = (data == 0).sum() / data.shape[0]
keep_metabolites &= (missing_per_metabolite < 0.5)
# Filter data
filtered_data = data.loc[:, keep_metabolites]
return filtered_data目标:移除低质量特征及背景噪声。
质量控制指标:
python
def metabolomics_qc(data, sample_metadata):
"""
Quality control for metabolomics data.
QC metrics:
- Coefficient of variation (CV) in QC samples
- Blank ratios (signal in samples vs blanks)
- Missing values per metabolite
- Total ion current per sample
"""
# 1. CV in QC samples (should be < 30%)
qc_samples = sample_metadata['sample_type'] == 'QC'
qc_data = data[qc_samples]
cv_per_metabolite = qc_data.std() / qc_data.mean()
high_cv = cv_per_metabolite > 0.3
print(f"Metabolites with CV > 30%: {high_cv.sum()}")
# 2. Blank subtraction
blank_samples = sample_metadata['sample_type'] == 'Blank'
blank_data = data[blank_samples]
# Calculate blank ratios
blank_means = blank_data.mean()
sample_means = data[~blank_samples & ~qc_samples].mean()
blank_ratio = sample_means / blank_means
# Filter: keep metabolites with sample/blank > 3
keep_metabolites = blank_ratio > 3
# 3. Missing values (remove if >50% missing)
missing_per_metabolite = (data == 0).sum() / data.shape[0]
keep_metabolites &= (missing_per_metabolite < 0.5)
# Filter data
filtered_data = data.loc[:, keep_metabolites]
return filtered_dataPhase 3: Normalization
阶段3:归一化
Objective: Account for technical variation and enable fair comparison.
Normalization methods:
1. Total Ion Current (TIC):
python
def normalize_tic(data):
"""
Normalize by total ion current.
Assumes total metabolite abundance is similar across samples.
"""
tic = data.sum(axis=1)
median_tic = tic.median()
norm_factors = median_tic / tic
normalized = data.multiply(norm_factors, axis=0)
return normalized2. Probabilistic Quotient Normalization (PQN):
python
def normalize_pqn(data, reference_sample=None):
"""
Probabilistic quotient normalization.
More robust than TIC to large metabolite changes.
"""
import numpy as np
# Use median sample as reference
if reference_sample is None:
reference = data.median(axis=0)
else:
reference = data.loc[reference_sample]
# Calculate quotients
quotients = data.div(reference, axis=1)
# Median quotient per sample
norm_factors = quotients.median(axis=1)
# Normalize
normalized = data.div(norm_factors, axis=0)
return normalized3. Internal Standard Normalization:
python
def normalize_internal_standard(data, is_metabolite):
"""
Normalize by spiked-in internal standard.
Most accurate if added before sample processing.
"""
is_abundance = data[is_metabolite]
norm_factors = is_abundance.median() / is_abundance
normalized = data.multiply(norm_factors, axis=0)
# Remove internal standard from data
normalized = normalized.drop(columns=[is_metabolite])
return normalizedTransformation:
python
def transform_data(data, method='log'):
"""
Transform metabolite abundances.
Methods:
- 'log': log2 transform (stabilize variance)
- 'pareto': Pareto scaling (mean-center, divide by sqrt(std))
- 'auto': Auto-scaling (mean-center, divide by std)
"""
import numpy as np
if method == 'log':
# Add small constant to avoid log(0)
transformed = np.log2(data + 1)
elif method == 'pareto':
# Pareto scaling (common in metabolomics)
mean = data.mean(axis=0)
std = data.std(axis=0)
transformed = (data - mean) / np.sqrt(std)
elif method == 'auto':
# Auto-scaling (z-score)
mean = data.mean(axis=0)
std = data.std(axis=0)
transformed = (data - mean) / std
return transformed目标:消除技术差异,实现样本间的公平比较。
归一化方法:
1. 总离子流归一化(TIC):
python
def normalize_tic(data):
"""
Normalize by total ion current.
Assumes total metabolite abundance is similar across samples.
"""
tic = data.sum(axis=1)
median_tic = tic.median()
norm_factors = median_tic / tic
normalized = data.multiply(norm_factors, axis=0)
return normalized2. 概率商归一化(PQN):
python
def normalize_pqn(data, reference_sample=None):
"""
Probabilistic quotient normalization.
More robust than TIC to large metabolite changes.
"""
import numpy as np
# Use median sample as reference
if reference_sample is None:
reference = data.median(axis=0)
else:
reference = data.loc[reference_sample]
# Calculate quotients
quotients = data.div(reference, axis=1)
# Median quotient per sample
norm_factors = quotients.median(axis=1)
# Normalize
normalized = data.div(norm_factors, axis=0)
return normalized3. 内标归一化:
python
def normalize_internal_standard(data, is_metabolite):
"""
Normalize by spiked-in internal standard.
Most accurate if added before sample processing.
"""
is_abundance = data[is_metabolite]
norm_factors = is_abundance.median() / is_abundance
normalized = data.multiply(norm_factors, axis=0)
# Remove internal standard from data
normalized = normalized.drop(columns=[is_metabolite])
return normalized数据转换:
python
def transform_data(data, method='log'):
"""
Transform metabolite abundances.
Methods:
- 'log': log2 transform (stabilize variance)
- 'pareto': Pareto scaling (mean-center, divide by sqrt(std))
- 'auto': Auto-scaling (mean-center, divide by std)
"""
import numpy as np
if method == 'log':
# Add small constant to avoid log(0)
transformed = np.log2(data + 1)
elif method == 'pareto':
# Pareto scaling (common in metabolomics)
mean = data.mean(axis=0)
std = data.std(axis=0)
transformed = (data - mean) / np.sqrt(std)
elif method == 'auto':
# Auto-scaling (z-score)
mean = data.mean(axis=0)
std = data.std(axis=0)
transformed = (data - mean) / std
return transformedPhase 4: Exploratory Analysis
阶段4:探索性分析
Objective: Visualize data structure and detect outliers.
PCA:
python
def perform_pca_metabolomics(data, sample_groups):
"""
Principal component analysis for sample clustering.
"""
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data)
# Plot
plt.figure(figsize=(8, 6))
for group in sample_groups.unique():
mask = sample_groups == group
plt.scatter(pca_result[mask, 0], pca_result[mask, 1], label=group)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
plt.legend()
plt.title('PCA - Metabolomics Data')PLS-DA (Partial Least Squares Discriminant Analysis):
python
def plsda_analysis(X, y, n_components=2):
"""
PLS-DA for supervised dimensionality reduction.
Better separation than PCA for classification tasks.
"""
from sklearn.cross_decomposition import PLSRegression
from sklearn.preprocessing import LabelEncoder
# Encode labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)
# PLS
pls = PLSRegression(n_components=n_components)
X_pls = pls.fit_transform(X, y_encoded)[0]
# Plot
plt.scatter(X_pls[:, 0], X_pls[:, 1], c=y_encoded)
plt.xlabel('PLS Component 1')
plt.ylabel('PLS Component 2')
plt.title('PLS-DA')
return X_pls目标:可视化数据结构,检测离群样本。
主成分分析(PCA):
python
def perform_pca_metabolomics(data, sample_groups):
"""
Principal component analysis for sample clustering.
"""
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data)
# Plot
plt.figure(figsize=(8, 6))
for group in sample_groups.unique():
mask = sample_groups == group
plt.scatter(pca_result[mask, 0], pca_result[mask, 1], label=group)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
plt.legend()
plt.title('PCA - Metabolomics Data')偏最小二乘判别分析(PLS-DA):
python
def plsda_analysis(X, y, n_components=2):
"""
PLS-DA for supervised dimensionality reduction.
Better separation than PCA for classification tasks.
"""
from sklearn.cross_decomposition import PLSRegression
from sklearn.preprocessing import LabelEncoder
# Encode labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)
# PLS
pls = PLSRegression(n_components=n_components)
X_pls = pls.fit_transform(X, y_encoded)[0]
# Plot
plt.scatter(X_pls[:, 0], X_pls[:, 1], c=y_encoded)
plt.xlabel('PLS Component 1')
plt.ylabel('PLS Component 2')
plt.title('PLS-DA')
return X_plsPhase 5: Differential Metabolite Analysis
阶段5:差异代谢物分析
Objective: Identify metabolites with significant abundance changes.
Statistical testing:
python
def differential_metabolites(data, group1_samples, group2_samples):
"""
Identify differential metabolites between two groups.
"""
from scipy import stats
import numpy as np
results = []
for metabolite in data.columns:
# Extract abundances
group1 = data.loc[group1_samples, metabolite]
group2 = data.loc[group2_samples, metabolite]
# Statistics
mean1 = group1.mean()
mean2 = group2.mean()
fold_change = mean2 / mean1
log2fc = np.log2(fold_change)
# t-test
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)
results.append({
'metabolite': metabolite,
'fold_change': fold_change,
'log2FC': log2fc,
'mean_group1': mean1,
'mean_group2': mean2,
'p_value': p_value,
't_statistic': t_stat
})
results_df = pd.DataFrame(results)
# Multiple testing correction
from statsmodels.stats.multitest import multipletests
results_df['adj_p_value'] = multipletests(results_df['p_value'], method='fdr_bh')[1]
# Significance
results_df['significant'] = (
(results_df['adj_p_value'] < 0.05) &
(np.abs(results_df['log2FC']) > 1.0)
)
return results_dfVolcano plot:
python
def plot_metabolite_volcano(de_results):
"""Visualize differential metabolite results."""
plt.figure(figsize=(8, 6))
# Non-significant
non_sig = de_results[~de_results['significant']]
plt.scatter(non_sig['log2FC'], -np.log10(non_sig['p_value']),
c='gray', alpha=0.5, s=20)
# Significant
sig = de_results[de_results['significant']]
plt.scatter(sig['log2FC'], -np.log10(sig['p_value']),
c='red', alpha=0.7, s=30)
plt.axhline(-np.log10(0.05), color='blue', linestyle='--')
plt.axvline(-1, color='blue', linestyle='--')
plt.axvline(1, color='blue', linestyle='--')
plt.xlabel('log2 Fold Change')
plt.ylabel('-log10(p-value)')
plt.title('Differential Metabolites')目标:鉴定丰度显著变化的代谢物。
统计检验:
python
def differential_metabolites(data, group1_samples, group2_samples):
"""
Identify differential metabolites between two groups.
"""
from scipy import stats
import numpy as np
results = []
for metabolite in data.columns:
# Extract abundances
group1 = data.loc[group1_samples, metabolite]
group2 = data.loc[group2_samples, metabolite]
# Statistics
mean1 = group1.mean()
mean2 = group2.mean()
fold_change = mean2 / mean1
log2fc = np.log2(fold_change)
# t-test
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)
results.append({
'metabolite': metabolite,
'fold_change': fold_change,
'log2FC': log2fc,
'mean_group1': mean1,
'mean_group2': mean2,
'p_value': p_value,
't_statistic': t_stat
})
results_df = pd.DataFrame(results)
# Multiple testing correction
from statsmodels.stats.multitest import multipletests
results_df['adj_p_value'] = multipletests(results_df['p_value'], method='fdr_bh')[1]
# Significance
results_df['significant'] = (
(results_df['adj_p_value'] < 0.05) &
(np.abs(results_df['log2FC']) > 1.0)
)
return results_df火山图:
python
def plot_metabolite_volcano(de_results):
"""Visualize differential metabolite results."""
plt.figure(figsize=(8, 6))
# Non-significant
non_sig = de_results[~de_results['significant']]
plt.scatter(non_sig['log2FC'], -np.log10(non_sig['p_value']),
c='gray', alpha=0.5, s=20)
# Significant
sig = de_results[de_results['significant']]
plt.scatter(sig['log2FC'], -np.log10(sig['p_value']),
c='red', alpha=0.7, s=30)
plt.axhline(-np.log10(0.05), color='blue', linestyle='--')
plt.axvline(-1, color='blue', linestyle='--')
plt.axvline(1, color='blue', linestyle='--')
plt.xlabel('log2 Fold Change')
plt.ylabel('-log10(p-value)')
plt.title('Differential Metabolites')Phase 6: Metabolic Pathway Analysis
阶段6:代谢通路分析
Objective: Interpret metabolite changes at pathway level.
Metabolite Set Enrichment Analysis (MSEA):
python
def pathway_enrichment_metabolites(metabolite_list, organism='human'):
"""
Perform pathway enrichment for differential metabolites.
Uses KEGG metabolic pathways.
"""
from tooluniverse import ToolUniverse
tu = ToolUniverse()
# Get KEGG compound IDs for metabolites
kegg_ids = []
for metabolite in metabolite_list:
# Convert HMDB ID to KEGG Compound ID
result = tu.run_one_function({
"name": "kegg_find_compound",
"arguments": {"query": metabolite}
})
if result and 'data' in result:
kegg_ids.append(result['data'][0]['entry_id'])
# Pathway enrichment
enrichment = tu.run_one_function({
"name": "kegg_enrich_pathway",
"arguments": {
"compound_list": ",".join(kegg_ids),
"organism": organism
}
})
return enrichmentPathway topology analysis:
python
def pathway_topology_analysis(metabolites, pathway_id):
"""
Analyze pathway dysregulation considering topology.
Metabolites at key pathway positions (hubs, bottlenecks)
have more impact than peripheral metabolites.
"""
# Load pathway structure from KEGG
# Calculate impact score based on:
# - Betweenness centrality (bottleneck metabolites)
# - Degree centrality (hub metabolites)
# - Pathway position (early vs late)
pass目标:从通路层面解读代谢物变化。
代谢物集富集分析(MSEA):
python
def pathway_enrichment_metabolites(metabolite_list, organism='human'):
"""
Perform pathway enrichment for differential metabolites.
Uses KEGG metabolic pathways.
"""
from tooluniverse import ToolUniverse
tu = ToolUniverse()
# Get KEGG compound IDs for metabolites
kegg_ids = []
for metabolite in metabolite_list:
# Convert HMDB ID to KEGG Compound ID
result = tu.run_one_function({
"name": "kegg_find_compound",
"arguments": {"query": metabolite}
})
if result and 'data' in result:
kegg_ids.append(result['data'][0]['entry_id'])
# Pathway enrichment
enrichment = tu.run_one_function({
"name": "kegg_enrich_pathway",
"arguments": {
"compound_list": ",".join(kegg_ids),
"organism": organism
}
})
return enrichment通路拓扑分析:
python
def pathway_topology_analysis(metabolites, pathway_id):
"""
Analyze pathway dysregulation considering topology.
Metabolites at key pathway positions (hubs, bottlenecks)
have more impact than peripheral metabolites.
"""
# Load pathway structure from KEGG
# Calculate impact score based on:
# - Betweenness centrality (bottleneck metabolites)
# - Degree centrality (hub metabolites)
# - Pathway position (early vs late)
passPhase 7: Multi-Omics Integration
阶段7:多组学整合
Objective: Integrate metabolomics with transcriptomics/proteomics.
Metabolite-enzyme correlation:
python
def correlate_metabolite_enzyme(metabolite_data, enzyme_expression):
"""
Correlate metabolite levels with enzyme expression.
Expected correlations:
- Substrate + enzyme → negative correlation (consumption)
- Product + enzyme → positive correlation (production)
"""
from scipy.stats import spearmanr
# For each metabolite-enzyme pair
correlations = {}
for metabolite in metabolite_data.columns:
# Find enzymes that produce/consume this metabolite
enzymes = find_metabolite_enzymes(metabolite)
for enzyme in enzymes:
if enzyme in enzyme_expression.index:
met_levels = metabolite_data[metabolite]
enz_expr = enzyme_expression.loc[enzyme]
r, p = spearmanr(met_levels, enz_expr)
correlations[f'{metabolite}_{enzyme}'] = {
'r': r,
'p': p,
'relationship': 'product' if r > 0 else 'substrate'
}
return correlationsPathway-level integration:
python
def integrate_omics_pathway(metabolite_fc, gene_fc, pathway_id):
"""
Integrate metabolite and gene fold changes at pathway level.
For each reaction:
- Check if metabolites are changed
- Check if enzymes are changed
- Score pathway dysregulation (combined evidence)
"""
# Load pathway reactions
# For each reaction:
# - Metabolite substrate/product changes
# - Enzyme expression changes
# - Concordance score
pathway_score = calculate_pathway_dysregulation(
metabolite_fc, gene_fc, pathway_id
)
return pathway_score目标:将代谢组学数据与转录组学/蛋白质组学数据整合。
代谢物-酶相关性分析:
python
def correlate_metabolite_enzyme(metabolite_data, enzyme_expression):
"""
Correlate metabolite levels with enzyme expression.
Expected correlations:
- Substrate + enzyme → negative correlation (consumption)
- Product + enzyme → positive correlation (production)
"""
from scipy.stats import spearmanr
# For each metabolite-enzyme pair
correlations = {}
for metabolite in metabolite_data.columns:
# Find enzymes that produce/consume this metabolite
enzymes = find_metabolite_enzymes(metabolite)
for enzyme in enzymes:
if enzyme in enzyme_expression.index:
met_levels = metabolite_data[metabolite]
enz_expr = enzyme_expression.loc[enzyme]
r, p = spearmanr(met_levels, enz_expr)
correlations[f'{metabolite}_{enzyme}'] = {
'r': r,
'p': p,
'relationship': 'product' if r > 0 else 'substrate'
}
return correlations通路层面整合:
python
def integrate_omics_pathway(metabolite_fc, gene_fc, pathway_id):
"""
Integrate metabolite and gene fold changes at pathway level.
For each reaction:
- Check if metabolites are changed
- Check if enzymes are changed
- Score pathway dysregulation (combined evidence)
"""
# Load pathway reactions
# For each reaction:
# - Metabolite substrate/product changes
# - Enzyme expression changes
# - Concordance score
pathway_score = calculate_pathway_dysregulation(
metabolite_fc, gene_fc, pathway_id
)
return pathway_scorePhase 8: Report Generation
阶段8:报告生成
Generate comprehensive metabolomics report:
markdown
undefined生成全面的代谢组学分析报告:
markdown
undefinedMetabolomics Analysis Report
Metabolomics Analysis Report
Dataset Summary
Dataset Summary
- Platform: LC-MS/MS (Orbitrap)
- Method: Untargeted metabolomics
- Samples: 40 (20 disease, 20 control)
- Metabolites Identified: 324 (Level 1/2 confidence)
- Metabolites Quantified: 298 (after QC)
- Platform: LC-MS/MS (Orbitrap)
- Method: Untargeted metabolomics
- Samples: 40 (20 disease, 20 control)
- Metabolites Identified: 324 (Level 1/2 confidence)
- Metabolites Quantified: 298 (after QC)
Quality Control
Quality Control
- CV in QC samples: 18% median (acceptable: <30%)
- Blank ratios: All metabolites > 3x blank signal
- Missing values: 8% average per metabolite
- Internal standard: Recovery 95-105% across samples
- CV in QC samples: 18% median (acceptable: <30%)
- Blank ratios: All metabolites > 3x blank signal
- Missing values: 8% average per metabolite
- Internal standard: Recovery 95-105% across samples
Normalization
Normalization
- Method: Probabilistic Quotient Normalization (PQN)
- Transformation: log2
- Batch correction: Not required (single batch)
- Method: Probabilistic Quotient Normalization (PQN)
- Transformation: log2
- Batch correction: Not required (single batch)
Exploratory Analysis
Exploratory Analysis
- PCA: Clear separation between groups (PC1: 28%, PC2: 18%)
- PLS-DA: Excellent discrimination (R2=0.89, Q2=0.75)
- Outliers: 1 sample removed (technical failure)
- PCA: Clear separation between groups (PC1: 28%, PC2: 18%)
- PLS-DA: Excellent discrimination (R2=0.89, Q2=0.75)
- Outliers: 1 sample removed (technical failure)
Differential Metabolites
Differential Metabolites
- Significant metabolites: 87 (adj. p < 0.05, |log2FC| > 1)
- Increased: 52 metabolites
- Decreased: 35 metabolites
- Significant metabolites: 87 (adj. p < 0.05, |log2FC| > 1)
- Increased: 52 metabolites
- Decreased: 35 metabolites
Top Increased Metabolites
Top Increased Metabolites
- Lactate (log2FC=3.2, p=1e-12) - Glycolysis
- Glutamine (log2FC=2.8, p=1e-10) - Amino acid metabolism
- Palmitate (log2FC=2.5, p=1e-9) - Fatty acid synthesis
- Lactate (log2FC=3.2, p=1e-12) - Glycolysis
- Glutamine (log2FC=2.8, p=1e-10) - Amino acid metabolism
- Palmitate (log2FC=2.5, p=1e-9) - Fatty acid synthesis
Top Decreased Metabolites
Top Decreased Metabolites
- Citrate (log2FC=-2.9, p=1e-11) - TCA cycle
- ATP (log2FC=-2.3, p=1e-9) - Energy metabolism
- NAD+ (log2FC=-2.1, p=1e-8) - Redox balance
- Citrate (log2FC=-2.9, p=1e-11) - TCA cycle
- ATP (log2FC=-2.3, p=1e-9) - Energy metabolism
- NAD+ (log2FC=-2.1, p=1e-8) - Redox balance
Pathway Enrichment
Pathway Enrichment
Top Dysregulated Pathways
Top Dysregulated Pathways
- Glycolysis/Gluconeogenesis (p=1e-15)
- 12 metabolites: glucose, pyruvate, lactate, etc.
- Direction: Increased flux to lactate (Warburg effect)
- TCA Cycle (p=1e-12)
- 8 metabolites: citrate, succinate, malate, etc.
- Direction: Decreased activity
- Glutaminolysis (p=1e-10)
- 6 metabolites: glutamine, glutamate, α-KG, etc.
- Direction: Increased glutamine consumption
- Glycolysis/Gluconeogenesis (p=1e-15)
- 12 metabolites: glucose, pyruvate, lactate, etc.
- Direction: Increased flux to lactate (Warburg effect)
- TCA Cycle (p=1e-12)
- 8 metabolites: citrate, succinate, malate, etc.
- Direction: Decreased activity
- Glutaminolysis (p=1e-10)
- 6 metabolites: glutamine, glutamate, α-KG, etc.
- Direction: Increased glutamine consumption
Multi-Omics Integration
Multi-Omics Integration
Metabolite-Enzyme Correlations
Metabolite-Enzyme Correlations
- LDHA (lactate dehydrogenase)
- Expression: 3.5-fold increased (RNA + protein)
- Lactate: 3.2-fold increased
- Correlation: r=0.85 (p<0.001) - Concordant upregulation
- IDH1 (isocitrate dehydrogenase)
- Expression: 2.1-fold decreased
- Citrate: 2.9-fold decreased
- Correlation: r=0.78 (p<0.001) - TCA cycle suppression
- LDHA (lactate dehydrogenase)
- Expression: 3.5-fold increased (RNA + protein)
- Lactate: 3.2-fold increased
- Correlation: r=0.85 (p<0.001) - Concordant upregulation
- IDH1 (isocitrate dehydrogenase)
- Expression: 2.1-fold decreased
- Citrate: 2.9-fold decreased
- Correlation: r=0.78 (p<0.001) - TCA cycle suppression
Metabolic Phenotype
Metabolic Phenotype
Integration with RNA-seq and proteomics reveals:
- Warburg effect: Shift from oxidative to glycolytic metabolism
- Glutamine addiction: Increased glutaminolysis for anaplerosis
- Redox imbalance: Decreased NAD+/NADH ratio, oxidative stress
Integration with RNA-seq and proteomics reveals:
- Warburg effect: Shift from oxidative to glycolytic metabolism
- Glutamine addiction: Increased glutaminolysis for anaplerosis
- Redox imbalance: Decreased NAD+/NADH ratio, oxidative stress
Biomarker Discovery
Biomarker Discovery
Top 10 Metabolites for Classification
Top 10 Metabolites for Classification
Random Forest model (10-fold CV):
- AUC: 0.96 ± 0.03
- Accuracy: 92%
Biomarker Panel:
- Lactate
- Glutamine
- Citrate
- ATP
- Palmitate
- Pyruvate
- Succinate
- NAD+
- α-ketoglutarate
- Glucose-6-phosphate
Random Forest model (10-fold CV):
- AUC: 0.96 ± 0.03
- Accuracy: 92%
Biomarker Panel:
- Lactate
- Glutamine
- Citrate
- ATP
- Palmitate
- Pyruvate
- Succinate
- NAD+
- α-ketoglutarate
- Glucose-6-phosphate
Biological Interpretation
Biological Interpretation
Metabolomics reveals fundamental metabolic reprogramming in disease state:
-
Glycolytic switch: Increased glycolysis with lactate accumulation despite oxygen availability (Warburg effect), driven by LDHA upregulation.
-
TCA cycle suppression: Decreased citrate and TCA intermediates, consistent with IDH1 downregulation. Shunts carbon to biosynthesis.
-
Glutamine dependence: Elevated glutamine consumption and glutaminolysis provides alternative carbon source for anaplerosis and NADPH for biosynthesis.
-
Biosynthetic activation: Increased palmitate indicates active fatty acid synthesis, supporting membrane production for proliferation.
-
Energy stress: Despite active glycolysis, ATP levels are decreased, suggesting high energy demand outpacing production.
This metabolic signature is characteristic of proliferative, biosynthetically active cells typical of cancer or activated immune cells.
Metabolomics reveals fundamental metabolic reprogramming in disease state:
-
Glycolytic switch: Increased glycolysis with lactate accumulation despite oxygen availability (Warburg effect), driven by LDHA upregulation.
-
TCA cycle suppression: Decreased citrate and TCA intermediates, consistent with IDH1 downregulation. Shunts carbon to biosynthesis.
-
Glutamine dependence: Elevated glutamine consumption and glutaminolysis provides alternative carbon source for anaplerosis and NADPH for biosynthesis.
-
Biosynthetic activation: Increased palmitate indicates active fatty acid synthesis, supporting membrane production for proliferation.
-
Energy stress: Despite active glycolysis, ATP levels are decreased, suggesting high energy demand outpacing production.
This metabolic signature is characteristic of proliferative, biosynthetically active cells typical of cancer or activated immune cells.
Clinical Relevance
Clinical Relevance
- Therapeutic targets: LDHA inhibitors, glutaminase inhibitors to disrupt metabolic dependencies
- Biomarkers: Lactate/citrate ratio as metabolic activity marker
- Drug response: Metabolic phenotype may predict sensitivity to metabolic inhibitors
---- Therapeutic targets: LDHA inhibitors, glutaminase inhibitors to disrupt metabolic dependencies
- Biomarkers: Lactate/citrate ratio as metabolic activity marker
- Drug response: Metabolic phenotype may predict sensitivity to metabolic inhibitors
---Integration with ToolUniverse
与ToolUniverse的整合
| Skill | Used For | Phase |
|---|---|---|
| Pathway enrichment | Phase 6 |
| Enzyme expression for integration | Phase 7 |
| Protein levels for integration | Phase 7 |
| Comprehensive integration | Phase 7 |
| 技能 | 用途 | 阶段 |
|---|---|---|
| 通路富集分析 | 阶段6 |
| 用于整合的酶表达分析 | 阶段7 |
| 用于整合的蛋白质水平分析 | 阶段7 |
| 全面多组学整合 | 阶段7 |
Quantified Minimums
最低要求
| Component | Requirement |
|---|---|
| Metabolites | At least 50 identified metabolites |
| Replicates | At least 3 per condition |
| QC | CV < 30% in QC samples, blank subtraction |
| Statistical test | t-test or Wilcoxon with FDR correction |
| Pathway analysis | MSEA with KEGG or Reactome |
| Report | QC, differential metabolites, pathways, visualizations |
| 组件 | 要求 |
|---|---|
| 代谢物 | 至少鉴定50种代谢物 |
| 重复样本 | 每个条件至少3个重复 |
| 质量控制 | QC样本CV < 30%,完成空白扣除 |
| 统计检验 | 采用t检验或Wilcoxon检验并进行FDR校正 |
| 通路分析 | 使用KEGG或Reactome进行MSEA分析 |
| 报告 | 包含质量控制、差异代谢物、通路分析及可视化内容 |
Limitations
局限性
- Identification: Many features remain unidentified (Level 4)
- Coverage: Cannot detect all metabolites (depends on method)
- Quantification: Relative abundance (not absolute concentration without standards)
- Isomers: Difficult to distinguish structural isomers
- Ion suppression: Matrix effects can affect quantification
- Dynamic range: Limited compared to targeted methods
- 鉴定能力:仍有大量特征无法鉴定(Level 4)
- 覆盖范围:无法检测所有代谢物(取决于实验方法)
- 定量方式:相对丰度定量(无标准品时无法获得绝对浓度)
- 异构体区分:难以区分结构异构体
- 离子抑制:基质效应会影响定量准确性
- 动态范围:相较于靶向方法,动态范围有限
References
参考文献
Methods:
- MetaboAnalyst: https://doi.org/10.1093/nar/gkab382
- XCMS: https://doi.org/10.1021/ac051437y
- MSEA: https://doi.org/10.1186/1471-2105-11-395
Databases:
- HMDB: https://hmdb.ca
- KEGG Compound: https://www.genome.jp/kegg/compound/
- Reactome: https://reactome.org
方法:
- MetaboAnalyst: https://doi.org/10.1093/nar/gkab382
- XCMS: https://doi.org/10.1021/ac051437y
- MSEA: https://doi.org/10.1186/1471-2105-11-395
数据库:
- HMDB: https://hmdb.ca
- KEGG Compound: https://www.genome.jp/kegg/compound/
- Reactome: https://reactome.org
undefined