tooluniverse-proteomics-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseProteomics Analysis
蛋白质组学数据分析
Comprehensive analysis of mass spectrometry-based proteomics data from protein identification through quantification, differential expression, post-translational modifications, and systems-level interpretation.
基于质谱的蛋白质组学数据全流程分析,涵盖蛋白质鉴定、定量、差异表达、翻译后修饰以及系统层面的解读。
When to Use This Skill
何时使用该技能
Triggers:
- User has proteomics data (MS output files)
- Questions about protein abundance or expression
- Differential protein expression analysis requests
- PTM analysis (phosphorylation, acetylation, ubiquitination)
- Protein-RNA correlation analysis
- Multi-omics integration involving proteomics
- Protein complex or interaction analysis
- Proteomics biomarker discovery
Example Questions This Skill Solves:
- "Analyze this MaxQuant output for differential protein expression"
- "Which proteins are significantly upregulated in disease vs control?"
- "Correlate protein abundance with mRNA expression"
- "What post-translational modifications change between conditions?"
- "Identify protein complexes in my co-IP MS data"
- "Which pathways are enriched in differentially expressed proteins?"
- "Find protein biomarkers for disease classification"
- "Compare protein and RNA levels to identify translation-regulated genes"
触发场景:
- 用户拥有蛋白质组学数据(质谱输出文件)
- 关于蛋白质丰度或表达的问题
- 差异蛋白质表达分析需求
- PTM分析(磷酸化、乙酰化、泛素化)
- 蛋白质-RNA相关性分析
- 涉及蛋白质组学的多组学整合
- 蛋白质复合物或相互作用分析
- 蛋白质组学生物标志物发现
该技能可解决的示例问题:
- "分析这份MaxQuant输出数据的差异蛋白质表达"
- "疾病组与对照组中哪些蛋白质显著上调?"
- "将蛋白质丰度与mRNA表达进行关联分析"
- "不同条件下哪些翻译后修饰发生了变化?"
- "从我的免疫共沉淀质谱数据中识别蛋白质复合物"
- "差异表达蛋白质富集于哪些通路?"
- "寻找用于疾病分类的蛋白质生物标志物"
- "比较蛋白质与RNA水平,识别翻译调控基因"
Core Capabilities
核心能力
| Capability | Description |
|---|---|
| Data Import | MaxQuant, Spectronaut, DIA-NN, Proteome Discoverer, FragPipe outputs |
| Quality Control | Missing value analysis, intensity distributions, sample clustering |
| Normalization | Median, quantile, TMM, VSN normalization methods |
| Imputation | MinProb, KNN, QRILC for missing values |
| Differential Expression | Limma, DEP, MSstats for statistical testing |
| PTM Analysis | Phospho-site localization, PTM enrichment, kinase prediction |
| Protein-RNA Integration | Correlation analysis, translation efficiency |
| Pathway Enrichment | Over-representation and GSEA for protein sets |
| PPI Analysis | Protein complex detection, interaction networks via STRING/IntAct |
| Reporting | Comprehensive reports with volcano plots, heatmaps, pathway diagrams |
| 能力 | 描述 |
|---|---|
| 数据导入 | 支持MaxQuant、Spectronaut、DIA-NN、Proteome Discoverer、FragPipe的输出文件 |
| 质量控制 | 缺失值分析、强度分布、样本聚类 |
| 归一化 | 中位数、分位数、TMM、VSN归一化方法 |
| 缺失值插补 | MinProb、KNN、QRILC缺失值插补算法 |
| 差异表达分析 | 使用Limma、DEP、MSstats进行统计检验 |
| PTM分析 | 磷酸化位点定位、PTM富集、激酶预测 |
| 蛋白质-RNA整合 | 相关性分析、翻译效率计算 |
| 通路富集 | 蛋白质集的过表达分析和GSEA分析 |
| 蛋白质相互作用分析 | 通过STRING/IntAct进行蛋白质复合物检测、相互作用网络分析 |
| 报告生成 | 包含火山图、热图、通路图的综合性报告 |
Workflow Overview
工作流程概览
Input: MS Proteomics Data
|
v
Phase 1: Data Import & QC
|-- Load MaxQuant/Spectronaut/DIA-NN output
|-- Parse protein groups, intensities, modifications
|-- Quality control plots (missing values, intensity distributions)
|-- Sample correlation and PCA
|
v
Phase 2: Preprocessing
|-- Filter low-confidence proteins
|-- Handle missing values (imputation or filtering)
|-- Log-transform intensities
|-- Normalize across samples
|
v
Phase 3: Differential Expression Analysis
|-- Statistical testing (limma, t-test, ANOVA)
|-- Multiple testing correction (BH, Bonferroni)
|-- Fold change calculation
|-- Significance thresholds (p < 0.05, |log2FC| > 1)
|
v
Phase 4: PTM Analysis (if applicable)
|-- Identify modified peptides
|-- Localization probability filtering
|-- PTM site quantification
|-- Kinase-substrate prediction
|-- PTM enrichment analysis
|
v
Phase 5: Functional Enrichment
|-- Gene Ontology enrichment
|-- KEGG/Reactome pathway enrichment
|-- Protein complex enrichment (CORUM)
|-- Tissue-specific enrichment
|
v
Phase 6: Protein-Protein Interactions
|-- Query STRING for interaction networks
|-- Identify protein complexes
|-- Network clustering and modules
|-- Hub protein identification
|
v
Phase 7: Multi-Omics Integration (optional)
|-- Correlate with RNA-seq data
|-- Identify translation-regulated proteins
|-- Compare with variant/CNV data
|-- Integrate with metabolomics
|
v
Phase 8: Generate Report
|-- Summary statistics
|-- Volcano plots and heatmaps
|-- Pathway diagrams
|-- Protein network visualizations
|-- Multi-omics integration plotsInput: MS Proteomics Data
|
v
Phase 1: Data Import & QC
|-- Load MaxQuant/Spectronaut/DIA-NN output
|-- Parse protein groups, intensities, modifications
|-- Quality control plots (missing values, intensity distributions)
|-- Sample correlation and PCA
|
v
Phase 2: Preprocessing
|-- Filter low-confidence proteins
|-- Handle missing values (imputation or filtering)
|-- Log-transform intensities
|-- Normalize across samples
|
v
Phase 3: Differential Expression Analysis
|-- Statistical testing (limma, t-test, ANOVA)
|-- Multiple testing correction (BH, Bonferroni)
|-- Fold change calculation
|-- Significance thresholds (p < 0.05, |log2FC| > 1)
|
v
Phase 4: PTM Analysis (if applicable)
|-- Identify modified peptides
|-- Localization probability filtering
|-- PTM site quantification
|-- Kinase-substrate prediction
|-- PTM enrichment analysis
|
v
Phase 5: Functional Enrichment
|-- Gene Ontology enrichment
|-- KEGG/Reactome pathway enrichment
|-- Protein complex enrichment (CORUM)
|-- Tissue-specific enrichment
|
v
Phase 6: Protein-Protein Interactions
|-- Query STRING for interaction networks
|-- Identify protein complexes
|-- Network clustering and modules
|-- Hub protein identification
|
v
Phase 7: Multi-Omics Integration (optional)
|-- Correlate with RNA-seq data
|-- Identify translation-regulated proteins
|-- Compare with variant/CNV data
|-- Integrate with metabolomics
|
v
Phase 8: Generate Report
|-- Summary statistics
|-- Volcano plots and heatmaps
|-- Pathway diagrams
|-- Protein network visualizations
|-- Multi-omics integration plotsPhase Details
各阶段详情
Phase 1: Data Import & Quality Control
阶段1:数据导入与质量控制
Objective: Load proteomics data and assess data quality.
Supported input formats:
MaxQuant (most common):
- - Protein-level quantification
proteinGroups.txt - - Peptide-level data
evidence.txt - - Phosphorylation sites
Phospho (STY)Sites.txt - - Other PTMs
modificationSpecificPeptides.txt
Spectronaut:
- - Protein/peptide quantification
*_Report.tsv - DIA-based quantification
DIA-NN:
- - Protein groups
report.tsv - - Protein matrix
report.pr_matrix.tsv
Proteome Discoverer:
*_Proteins.txt*_PSMs.txt
Data loading:
python
def load_maxquant_proteins(protein_groups_file):
"""
Load MaxQuant proteinGroups.txt file.
Returns:
- DataFrame with proteins as rows, samples as columns
- Metadata (protein names, gene names, sequence coverage)
"""
import pandas as pd
# Read file
df = pd.read_csv(protein_groups_file, sep='\t')
# Extract intensity columns (LFQ or raw)
intensity_cols = [col for col in df.columns if 'LFQ intensity' in col or 'Intensity ' in col]
# Create intensity matrix
intensity_matrix = df[intensity_cols].copy()
intensity_matrix.columns = [col.replace('LFQ intensity ', '').replace('Intensity ', '')
for col in intensity_cols]
# Add protein metadata
metadata = df[['Protein IDs', 'Gene names', 'Fasta headers',
'Peptides', 'Sequence coverage [%]']].copy()
return intensity_matrix, metadataQuality Control:
- Missing value assessment:
python
def assess_missing_values(intensity_matrix):
"""
Calculate percentage of missing values per protein and sample.
"""
# Per protein
missing_per_protein = (intensity_matrix == 0).sum(axis=1) / intensity_matrix.shape[1]
# Per sample
missing_per_sample = (intensity_matrix == 0).sum(axis=0) / intensity_matrix.shape[0]
# Visualize
plot_missing_value_heatmap(intensity_matrix)
return missing_per_protein, missing_per_sample- Intensity distribution:
python
def plot_intensity_distributions(intensity_matrix):
"""
Plot log10 intensity distributions per sample.
Check for consistent distributions across samples.
"""
import matplotlib.pyplot as plt
import numpy as np
log_intensities = np.log10(intensity_matrix.replace(0, np.nan))
# Boxplot per sample
log_intensities.plot(kind='box')
plt.ylabel('log10 Intensity')
plt.title('Intensity Distribution per Sample')
# Should see similar median and spread across samples- Sample correlation:
python
def plot_sample_correlation(intensity_matrix):
"""
Calculate and visualize sample-sample correlation.
Expect: High correlation within replicates, lower between conditions.
"""
# Log-transform and remove zeros
log_data = np.log2(intensity_matrix.replace(0, np.nan))
# Correlation matrix
corr_matrix = log_data.corr(method='pearson')
# Heatmap
import seaborn as sns
sns.heatmap(corr_matrix, annot=True, cmap='RdYlBu_r', vmin=0.8, vmax=1.0)- PCA:
python
def perform_pca(intensity_matrix, sample_groups):
"""
Principal component analysis for sample clustering.
"""
from sklearn.decomposition import PCA
# Prepare data (log, impute, scale)
log_data = np.log2(intensity_matrix.replace(0, np.nan))
# Simple imputation with minimum value
imputed = log_data.fillna(log_data.min().min())
# PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(imputed.T)
# Plot with group colors
plt.scatter(pca_result[:, 0], pca_result[:, 1], c=sample_groups)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')目标: 加载蛋白质组学数据并评估数据质量。
支持的输入格式:
MaxQuant(最常用):
- - 蛋白质水平定量数据
proteinGroups.txt - - 肽段水平数据
evidence.txt - - 磷酸化位点数据
Phospho (STY)Sites.txt - - 其他PTM数据
modificationSpecificPeptides.txt
Spectronaut:
- - 蛋白质/肽段定量数据
*_Report.tsv - 基于DIA的定量数据
DIA-NN:
- - 蛋白质组数据
report.tsv - - 蛋白质矩阵
report.pr_matrix.tsv
Proteome Discoverer:
*_Proteins.txt*_PSMs.txt
数据加载:
python
def load_maxquant_proteins(protein_groups_file):
"""
Load MaxQuant proteinGroups.txt file.
Returns:
- DataFrame with proteins as rows, samples as columns
- Metadata (protein names, gene names, sequence coverage)
"""
import pandas as pd
# Read file
df = pd.read_csv(protein_groups_file, sep='\t')
# Extract intensity columns (LFQ or raw)
intensity_cols = [col for col in df.columns if 'LFQ intensity' in col or 'Intensity ' in col]
# Create intensity matrix
intensity_matrix = df[intensity_cols].copy()
intensity_matrix.columns = [col.replace('LFQ intensity ', '').replace('Intensity ', '')
for col in intensity_cols]
# Add protein metadata
metadata = df[['Protein IDs', 'Gene names', 'Fasta headers',
'Peptides', 'Sequence coverage [%]']].copy()
return intensity_matrix, metadata质量控制:
- 缺失值评估:
python
def assess_missing_values(intensity_matrix):
"""
Calculate percentage of missing values per protein and sample.
"""
# Per protein
missing_per_protein = (intensity_matrix == 0).sum(axis=1) / intensity_matrix.shape[1]
# Per sample
missing_per_sample = (intensity_matrix == 0).sum(axis=0) / intensity_matrix.shape[0]
# Visualize
plot_missing_value_heatmap(intensity_matrix)
return missing_per_protein, missing_per_sample- 强度分布:
python
def plot_intensity_distributions(intensity_matrix):
"""
Plot log10 intensity distributions per sample.
Check for consistent distributions across samples.
"""
import matplotlib.pyplot as plt
import numpy as np
log_intensities = np.log10(intensity_matrix.replace(0, np.nan))
# Boxplot per sample
log_intensities.plot(kind='box')
plt.ylabel('log10 Intensity')
plt.title('Intensity Distribution per Sample')
# Should see similar median and spread across samples- 样本相关性:
python
def plot_sample_correlation(intensity_matrix):
"""
Calculate and visualize sample-sample correlation.
Expect: High correlation within replicates, lower between conditions.
"""
# Log-transform and remove zeros
log_data = np.log2(intensity_matrix.replace(0, np.nan))
# Correlation matrix
corr_matrix = log_data.corr(method='pearson')
# Heatmap
import seaborn as sns
sns.heatmap(corr_matrix, annot=True, cmap='RdYlBu_r', vmin=0.8, vmax=1.0)- 主成分分析(PCA):
python
def perform_pca(intensity_matrix, sample_groups):
"""
Principal component analysis for sample clustering.
"""
from sklearn.decomposition import PCA
# Prepare data (log, impute, scale)
log_data = np.log2(intensity_matrix.replace(0, np.nan))
# Simple imputation with minimum value
imputed = log_data.fillna(log_data.min().min())
# PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(imputed.T)
# Plot with group colors
plt.scatter(pca_result[:, 0], pca_result[:, 1], c=sample_groups)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')Phase 2: Preprocessing & Normalization
阶段2:预处理与归一化
Objective: Clean data and normalize across samples for fair comparison.
Filtering:
python
def filter_proteins(intensity_matrix, metadata, min_valid=3):
"""
Filter out low-confidence proteins.
Criteria:
- At least 2 unique peptides (from metadata)
- At least min_valid samples with detected intensity
- Remove contaminants and reverse sequences
"""
# Filter by peptide count
valid_proteins = metadata['Peptides'] >= 2
# Filter by detection in samples
n_detected = (intensity_matrix > 0).sum(axis=1)
valid_detection = n_detected >= min_valid
# Remove contaminants (from MaxQuant)
is_contaminant = metadata['Protein IDs'].str.contains('CON__', na=False)
is_reverse = metadata['Protein IDs'].str.contains('REV__', na=False)
# Combined filter
keep = valid_proteins & valid_detection & ~is_contaminant & ~is_reverse
return intensity_matrix[keep], metadata[keep]Missing value imputation:
python
def impute_missing_values(intensity_matrix, method='MinProb'):
"""
Impute missing protein intensities.
Methods:
- MinProb: Random from minimum observed + normal noise (for MNAR)
- KNN: K-nearest neighbors imputation
- QRILC: Quantile regression-based imputation
"""
if method == 'MinProb':
# Assume missing = low abundance (MNAR assumption)
min_val = intensity_matrix[intensity_matrix > 0].min().min()
width = 0.3 # Standard deviation of noise
shift = 1.8 # Downshift from minimum
# Replace zeros with random low values
imputed = intensity_matrix.copy()
missing_mask = imputed == 0
n_missing = missing_mask.sum().sum()
random_vals = np.random.normal(
loc=min_val - shift,
scale=width,
size=n_missing
)
imputed.values[missing_mask.values] = random_vals
return imputed
elif method == 'KNN':
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
imputed = pd.DataFrame(
imputer.fit_transform(intensity_matrix.replace(0, np.nan)),
index=intensity_matrix.index,
columns=intensity_matrix.columns
)
return imputedNormalization:
python
def normalize_intensities(intensity_matrix, method='median'):
"""
Normalize protein intensities across samples.
Methods:
- median: Divide by median intensity per sample
- quantile: Quantile normalization (same distribution)
- TMM: Trimmed mean of M-values (from edgeR)
- VSN: Variance-stabilizing normalization
"""
if method == 'median':
# Median normalization
medians = intensity_matrix.median(axis=0)
global_median = medians.median()
norm_factors = global_median / medians
normalized = intensity_matrix * norm_factors
return normalized
elif method == 'quantile':
# Quantile normalization
from sklearn.preprocessing import quantile_transform
normalized = pd.DataFrame(
quantile_transform(intensity_matrix, axis=1),
index=intensity_matrix.index,
columns=intensity_matrix.columns
)
return normalized目标: 清洗数据并在样本间进行归一化,以实现公平比较。
过滤:
python
def filter_proteins(intensity_matrix, metadata, min_valid=3):
"""
Filter out low-confidence proteins.
Criteria:
- At least 2 unique peptides (from metadata)
- At least min_valid samples with detected intensity
- Remove contaminants and reverse sequences
"""
# Filter by peptide count
valid_proteins = metadata['Peptides'] >= 2
# Filter by detection in samples
n_detected = (intensity_matrix > 0).sum(axis=1)
valid_detection = n_detected >= min_valid
# Remove contaminants (from MaxQuant)
is_contaminant = metadata['Protein IDs'].str.contains('CON__', na=False)
is_reverse = metadata['Protein IDs'].str.contains('REV__', na=False)
# Combined filter
keep = valid_proteins & valid_detection & ~is_contaminant & ~is_reverse
return intensity_matrix[keep], metadata[keep]缺失值插补:
python
def impute_missing_values(intensity_matrix, method='MinProb'):
"""
Impute missing protein intensities.
Methods:
- MinProb: Random from minimum observed + normal noise (for MNAR)
- KNN: K-nearest neighbors imputation
- QRILC: Quantile regression-based imputation
"""
if method == 'MinProb':
# Assume missing = low abundance (MNAR assumption)
min_val = intensity_matrix[intensity_matrix > 0].min().min()
width = 0.3 # Standard deviation of noise
shift = 1.8 # Downshift from minimum
# Replace zeros with random low values
imputed = intensity_matrix.copy()
missing_mask = imputed == 0
n_missing = missing_mask.sum().sum()
random_vals = np.random.normal(
loc=min_val - shift,
scale=width,
size=n_missing
)
imputed.values[missing_mask.values] = random_vals
return imputed
elif method == 'KNN':
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
imputed = pd.DataFrame(
imputer.fit_transform(intensity_matrix.replace(0, np.nan)),
index=intensity_matrix.index,
columns=intensity_matrix.columns
)
return imputed归一化:
python
def normalize_intensities(intensity_matrix, method='median'):
"""
Normalize protein intensities across samples.
Methods:
- median: Divide by median intensity per sample
- quantile: Quantile normalization (same distribution)
- TMM: Trimmed mean of M-values (from edgeR)
- VSN: Variance-stabilizing normalization
"""
if method == 'median':
# Median normalization
medians = intensity_matrix.median(axis=0)
global_median = medians.median()
norm_factors = global_median / medians
normalized = intensity_matrix * norm_factors
return normalized
elif method == 'quantile':
# Quantile normalization
from sklearn.preprocessing import quantile_transform
normalized = pd.DataFrame(
quantile_transform(intensity_matrix, axis=1),
index=intensity_matrix.index,
columns=intensity_matrix.columns
)
return normalizedPhase 3: Differential Expression Analysis
阶段3:差异表达分析
Objective: Identify proteins with significant abundance changes between conditions.
Statistical testing with limma:
python
def differential_expression_limma(log2_intensities, group1_samples, group2_samples):
"""
Perform differential expression using limma-like approach.
Returns:
- log2 fold changes
- p-values
- adjusted p-values (BH)
"""
from scipy import stats
results = []
for protein in log2_intensities.index:
# Extract intensities for each group
group1 = log2_intensities.loc[protein, group1_samples]
group2 = log2_intensities.loc[protein, group2_samples]
# Calculate statistics
mean1 = group1.mean()
mean2 = group2.mean()
log2fc = mean2 - mean1
# t-test
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)
results.append({
'protein': protein,
'log2FC': log2fc,
'mean_group1': mean1,
'mean_group2': mean2,
'p_value': p_value,
't_statistic': t_stat
})
results_df = pd.DataFrame(results)
# Multiple testing correction (Benjamini-Hochberg)
from statsmodels.stats.multitest import multipletests
results_df['adj_p_value'] = multipletests(results_df['p_value'], method='fdr_bh')[1]
# Classify significance
results_df['significant'] = (
(results_df['adj_p_value'] < 0.05) &
(np.abs(results_df['log2FC']) > 1.0)
)
return results_dfVolcano plot:
python
def plot_volcano(de_results, title='Volcano Plot'):
"""
Visualize differential expression results.
"""
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
# Non-significant
non_sig = de_results[~de_results['significant']]
plt.scatter(non_sig['log2FC'], -np.log10(non_sig['p_value']),
c='gray', alpha=0.5, s=10)
# Significant
sig = de_results[de_results['significant']]
plt.scatter(sig['log2FC'], -np.log10(sig['p_value']),
c='red', alpha=0.7, s=20)
# Thresholds
plt.axhline(-np.log10(0.05), color='blue', linestyle='--', label='p=0.05')
plt.axvline(-1, color='blue', linestyle='--')
plt.axvline(1, color='blue', linestyle='--', label='|log2FC|=1')
plt.xlabel('log2 Fold Change')
plt.ylabel('-log10(p-value)')
plt.title(title)
plt.legend()目标: 识别不同条件下丰度发生显著变化的蛋白质。
使用limma进行统计检验:
python
def differential_expression_limma(log2_intensities, group1_samples, group2_samples):
"""
Perform differential expression using limma-like approach.
Returns:
- log2 fold changes
- p-values
- adjusted p-values (BH)
"""
from scipy import stats
results = []
for protein in log2_intensities.index:
# Extract intensities for each group
group1 = log2_intensities.loc[protein, group1_samples]
group2 = log2_intensities.loc[protein, group2_samples]
# Calculate statistics
mean1 = group1.mean()
mean2 = group2.mean()
log2fc = mean2 - mean1
# t-test
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)
results.append({
'protein': protein,
'log2FC': log2fc,
'mean_group1': mean1,
'mean_group2': mean2,
'p_value': p_value,
't_statistic': t_stat
})
results_df = pd.DataFrame(results)
# Multiple testing correction (Benjamini-Hochberg)
from statsmodels.stats.multitest import multipletests
results_df['adj_p_value'] = multipletests(results_df['p_value'], method='fdr_bh')[1]
# Classify significance
results_df['significant'] = (
(results_df['adj_p_value'] < 0.05) &
(np.abs(results_df['log2FC']) > 1.0)
)
return results_df火山图:
python
def plot_volcano(de_results, title='Volcano Plot'):
"""
Visualize differential expression results.
"""
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
# Non-significant
non_sig = de_results[~de_results['significant']]
plt.scatter(non_sig['log2FC'], -np.log10(non_sig['p_value']),
c='gray', alpha=0.5, s=10)
# Significant
sig = de_results[de_results['significant']]
plt.scatter(sig['log2FC'], -np.log10(sig['p_value']),
c='red', alpha=0.7, s=20)
# Thresholds
plt.axhline(-np.log10(0.05), color='blue', linestyle='--', label='p=0.05')
plt.axvline(-1, color='blue', linestyle='--')
plt.axvline(1, color='blue', linestyle='--', label='|log2FC|=1')
plt.xlabel('log2 Fold Change')
plt.ylabel('-log10(p-value)')
plt.title(title)
plt.legend()Phase 4: PTM Analysis
阶段4:PTM分析
Objective: Analyze post-translational modifications (phosphorylation, acetylation, etc.)
Phosphoproteomics workflow:
python
def analyze_phosphosites(phospho_sites_file, intensity_matrix):
"""
Analyze phosphorylation site changes.
Input: MaxQuant Phospho (STY)Sites.txt
Output: Differential phosphorylation per site
"""
# Load phospho data
phospho = pd.read_csv(phospho_sites_file, sep='\t')
# Filter by localization probability
phospho_confident = phospho[phospho['Localization prob'] > 0.75]
# Extract site information
phospho_confident['site'] = (
phospho_confident['Gene names'] + '_' +
phospho_confident['Amino acid'] +
phospho_confident['Position'].astype(str)
)
# Quantification (similar to protein-level analysis)
# ... perform differential analysis ...
return phospho_resultsKinase-substrate prediction:
python
def predict_kinases(phospho_sites):
"""
Predict upstream kinases for phosphorylation sites.
Uses ToolUniverse PhosphoSitePlus or KEA3 tools.
"""
from tooluniverse import ToolUniverse
tu = ToolUniverse()
# For each significant phosphosite
kinase_predictions = []
for site in phospho_sites:
# Query kinase-substrate databases
# (would use actual ToolUniverse tool here)
result = tu.run_one_function({
"name": "phosphosite_plus_query", # hypothetical
"arguments": {"site": site}
})
kinase_predictions.append(result)
return kinase_predictions目标: 分析翻译后修饰(磷酸化、乙酰化等)。
磷酸化蛋白质组学工作流程:
python
def analyze_phosphosites(phospho_sites_file, intensity_matrix):
"""
Analyze phosphorylation site changes.
Input: MaxQuant Phospho (STY)Sites.txt
Output: Differential phosphorylation per site
"""
# Load phospho data
phospho = pd.read_csv(phospho_sites_file, sep='\t')
# Filter by localization probability
phospho_confident = phospho[phospho['Localization prob'] > 0.75]
# Extract site information
phospho_confident['site'] = (
phospho_confident['Gene names'] + '_' +
phospho_confident['Amino acid'] +
phospho_confident['Position'].astype(str)
)
# Quantification (similar to protein-level analysis)
# ... perform differential analysis ...
return phospho_results激酶-底物预测:
python
def predict_kinases(phospho_sites):
"""
Predict upstream kinases for phosphorylation sites.
Uses ToolUniverse PhosphoSitePlus or KEA3 tools.
"""
from tooluniverse import ToolUniverse
tu = ToolUniverse()
# For each significant phosphosite
kinase_predictions = []
for site in phospho_sites:
# Query kinase-substrate databases
# (would use actual ToolUniverse tool here)
result = tu.run_one_function({
"name": "phosphosite_plus_query", # hypothetical
"arguments": {"site": site}
})
kinase_predictions.append(result)
return kinase_predictionsPhase 5: Functional Enrichment
阶段5:功能富集分析
Objective: Interpret biological meaning of protein changes via pathway analysis.
Gene Ontology enrichment:
python
def pathway_enrichment_proteins(de_proteins, organism='human'):
"""
Perform pathway enrichment for differentially expressed proteins.
Uses ToolUniverse gene-enrichment skill.
"""
from tooluniverse import ToolUniverse
tu = ToolUniverse()
# Extract gene names for significant proteins
sig_proteins = de_proteins[de_proteins['significant']]
gene_list = sig_proteins['gene_name'].tolist()
# Run enrichment via ToolUniverse
enrichment = tu.run_one_function({
"name": "enrichr_enrich",
"arguments": {
"gene_list": ",".join(gene_list),
"library": "KEGG_2021_Human"
}
})
return enrichmentProtein complex enrichment:
python
def protein_complex_enrichment(protein_list):
"""
Test for enrichment of known protein complexes (CORUM database).
"""
# Query CORUM or use ToolUniverse
# Identify if proteins are part of known complexes
pass目标: 通过通路分析解读蛋白质变化的生物学意义。
基因本体(GO)富集分析:
python
def pathway_enrichment_proteins(de_proteins, organism='human'):
"""
Perform pathway enrichment for differentially expressed proteins.
Uses ToolUniverse gene-enrichment skill.
"""
from tooluniverse import ToolUniverse
tu = ToolUniverse()
# Extract gene names for significant proteins
sig_proteins = de_proteins[de_proteins['significant']]
gene_list = sig_proteins['gene_name'].tolist()
# Run enrichment via ToolUniverse
enrichment = tu.run_one_function({
"name": "enrichr_enrich",
"arguments": {
"gene_list": ",".join(gene_list),
"library": "KEGG_2021_Human"
}
})
return enrichment蛋白质复合物富集分析:
python
def protein_complex_enrichment(protein_list):
"""
Test for enrichment of known protein complexes (CORUM database).
"""
# Query CORUM or use ToolUniverse
# Identify if proteins are part of known complexes
passPhase 6: Protein-Protein Interactions
阶段6:蛋白质-蛋白质相互作用分析
Objective: Identify interaction networks and protein complexes.
STRING network analysis:
python
def build_protein_network(protein_list, confidence=0.7):
"""
Build PPI network using STRING database.
Uses ToolUniverse STRING tools.
"""
from tooluniverse import ToolUniverse
tu = ToolUniverse()
# Get interactions
interactions = tu.run_one_function({
"name": "string_get_interactions",
"arguments": {
"proteins": ",".join(protein_list),
"species": 9606, # human
"score_threshold": int(confidence * 1000)
}
})
# Build network graph
import networkx as nx
G = nx.Graph()
for interaction in interactions['data']:
G.add_edge(
interaction['protein1'],
interaction['protein2'],
score=interaction['score']
)
return GModule detection:
python
def detect_protein_modules(network_graph):
"""
Identify tightly connected protein modules (complexes).
"""
from networkx.algorithms import community
# Detect communities
communities = community.greedy_modularity_communities(network_graph)
# Annotate modules with enriched functions
modules = []
for i, comm in enumerate(communities):
module_proteins = list(comm)
# Run enrichment for this module
enrichment = pathway_enrichment_proteins(module_proteins)
modules.append({
'module_id': i,
'proteins': module_proteins,
'size': len(module_proteins),
'top_function': enrichment['top_terms'][0]
})
return modules目标: 识别相互作用网络和蛋白质复合物。
STRING网络分析:
python
def build_protein_network(protein_list, confidence=0.7):
"""
Build PPI network using STRING database.
Uses ToolUniverse STRING tools.
"""
from tooluniverse import ToolUniverse
tu = ToolUniverse()
# Get interactions
interactions = tu.run_one_function({
"name": "string_get_interactions",
"arguments": {
"proteins": ",".join(protein_list),
"species": 9606, # human
"score_threshold": int(confidence * 1000)
}
})
# Build network graph
import networkx as nx
G = nx.Graph()
for interaction in interactions['data']:
G.add_edge(
interaction['protein1'],
interaction['protein2'],
score=interaction['score']
)
return G模块检测:
python
def detect_protein_modules(network_graph):
"""
Identify tightly connected protein modules (complexes).
"""
from networkx.algorithms import community
# Detect communities
communities = community.greedy_modularity_communities(network_graph)
# Annotate modules with enriched functions
modules = []
for i, comm in enumerate(communities):
module_proteins = list(comm)
# Run enrichment for this module
enrichment = pathway_enrichment_proteins(module_proteins)
modules.append({
'module_id': i,
'proteins': module_proteins,
'size': len(module_proteins),
'top_function': enrichment['top_terms'][0]
})
return modulesPhase 7: Multi-Omics Integration
阶段7:多组学整合
Objective: Integrate proteomics with transcriptomics and other omics.
Protein-RNA correlation:
python
def correlate_protein_rna(protein_data, rna_data, common_samples):
"""
Correlate protein and mRNA levels for each gene.
Expected: r ~ 0.4-0.6 (moderate correlation)
Discordance indicates post-transcriptional regulation
"""
from scipy.stats import spearmanr
# Find common genes
common_genes = set(protein_data.index) & set(rna_data.index)
correlations = {}
for gene in common_genes:
protein = protein_data.loc[gene, common_samples]
rna = rna_data.loc[gene, common_samples]
r, p = spearmanr(protein, rna)
correlations[gene] = {
'r': r,
'p': p,
'regulation': classify_regulation(r, protein.mean(), rna.mean())
}
return correlations
def classify_regulation(r, protein_level, rna_level):
"""
Classify regulatory mechanism based on correlation and levels.
"""
if r > 0.6 and protein_level > 0 and rna_level > 0:
return 'transcriptional_upregulation'
elif r > 0.6 and protein_level < 0 and rna_level < 0:
return 'transcriptional_downregulation'
elif r < 0.2 and protein_level > 0 and rna_level < 0:
return 'translational_upregulation'
elif r < 0.2 and protein_level < 0 and rna_level > 0:
return 'protein_degradation'
else:
return 'mixed_regulation'Integration with multi-omics skill:
python
def integrate_with_multiomics(protein_data, rna_data, methylation_data):
"""
Pass proteomics data to multi-omics integration skill.
Enables comprehensive analysis across all molecular layers.
"""
# Prepare for multi-omics skill
omics_data = {
'proteomics': protein_data,
'rnaseq': rna_data,
'methylation': methylation_data
}
# Invoke multi-omics integration skill
from tooluniverse import ToolUniverse
# (Would use Skill tool to invoke tooluniverse-multi-omics-integration)
return integrated_analysis目标: 将蛋白质组学数据与转录组学及其他组学数据整合。
蛋白质-RNA相关性分析:
python
def correlate_protein_rna(protein_data, rna_data, common_samples):
"""
Correlate protein and mRNA levels for each gene.
Expected: r ~ 0.4-0.6 (moderate correlation)
Discordance indicates post-transcriptional regulation
"""
from scipy.stats import spearmanr
# Find common genes
common_genes = set(protein_data.index) & set(rna_data.index)
correlations = {}
for gene in common_genes:
protein = protein_data.loc[gene, common_samples]
rna = rna_data.loc[gene, common_samples]
r, p = spearmanr(protein, rna)
correlations[gene] = {
'r': r,
'p': p,
'regulation': classify_regulation(r, protein.mean(), rna.mean())
}
return correlations
def classify_regulation(r, protein_level, rna_level):
"""
Classify regulatory mechanism based on correlation and levels.
"""
if r > 0.6 and protein_level > 0 and rna_level > 0:
return 'transcriptional_upregulation'
elif r > 0.6 and protein_level < 0 and rna_level < 0:
return 'transcriptional_downregulation'
elif r < 0.2 and protein_level > 0 and rna_level < 0:
return 'translational_upregulation'
elif r < 0.2 and protein_level < 0 and rna_level > 0:
return 'protein_degradation'
else:
return 'mixed_regulation'与多组学技能的整合:
python
def integrate_with_multiomics(protein_data, rna_data, methylation_data):
"""
Pass proteomics data to multi-omics integration skill.
Enables comprehensive analysis across all molecular layers.
"""
# Prepare for multi-omics skill
omics_data = {
'proteomics': protein_data,
'rnaseq': rna_data,
'methylation': methylation_data
}
# Invoke multi-omics integration skill
from tooluniverse import ToolUniverse
# (Would use Skill tool to invoke tooluniverse-multi-omics-integration)
return integrated_analysisPhase 8: Report Generation
阶段8:报告生成
Generate comprehensive proteomics report:
markdown
undefined生成综合性蛋白质组学报告:
markdown
undefinedProteomics Analysis Report
Proteomics Analysis Report
Dataset Summary
Dataset Summary
- Samples: 20 (10 disease, 10 control)
- Proteins Identified: 5,432
- Proteins Quantified: 4,987 (at least 3 samples)
- Platform: Orbitrap Fusion Lumos, MaxQuant 2.0
- Samples: 20 (10 disease, 10 control)
- Proteins Identified: 5,432
- Proteins Quantified: 4,987 (at least 3 samples)
- Platform: Orbitrap Fusion Lumos, MaxQuant 2.0
Quality Control
Quality Control
- Missing Values: 15% average per protein
- Sample Correlation: 0.92-0.98 within groups
- PCA: Clear separation between disease and control (PC1: 35% variance)
- Missing Values: 15% average per protein
- Sample Correlation: 0.92-0.98 within groups
- PCA: Clear separation between disease and control (PC1: 35% variance)
Differential Expression
Differential Expression
- Significant Proteins: 432 (adj. p < 0.05, |log2FC| > 1)
- Upregulated: 245 proteins
- Downregulated: 187 proteins
- Top upregulated: MYC (log2FC=3.2), EGFR (log2FC=2.8)
- Top downregulated: TP53 (log2FC=-2.5), BRCA1 (log2FC=-2.1)
- Significant Proteins: 432 (adj. p < 0.05, |log2FC| > 1)
- Upregulated: 245 proteins
- Downregulated: 187 proteins
- Top upregulated: MYC (log2FC=3.2), EGFR (log2FC=2.8)
- Top downregulated: TP53 (log2FC=-2.5), BRCA1 (log2FC=-2.1)
Phosphoproteomics
Phosphoproteomics
- Phosphosites Quantified: 8,543
- Differentially Phosphorylated: 234 sites (p < 0.05)
- Top Predicted Kinases: CDK1, MAPK1, AKT1
- Phosphosites Quantified: 8,543
- Differentially Phosphorylated: 234 sites (p < 0.05)
- Top Predicted Kinases: CDK1, MAPK1, AKT1
Pathway Enrichment
Pathway Enrichment
Top Pathways (Upregulated)
Top Pathways (Upregulated)
- Cell Cycle (p=1e-15) - 45 proteins, including cyclins, CDKs
- DNA Replication (p=1e-12) - 23 proteins
- Glycolysis (p=1e-10) - 18 proteins
- Cell Cycle (p=1e-15) - 45 proteins, including cyclins, CDKs
- DNA Replication (p=1e-12) - 23 proteins
- Glycolysis (p=1e-10) - 18 proteins
Top Pathways (Downregulated)
Top Pathways (Downregulated)
- Apoptosis (p=1e-14) - 32 proteins, including caspases
- DNA Repair (p=1e-11) - 28 proteins
- Oxidative Phosphorylation (p=1e-9) - 25 proteins
- Apoptosis (p=1e-14) - 32 proteins, including caspases
- DNA Repair (p=1e-11) - 28 proteins
- Oxidative Phosphorylation (p=1e-9) - 25 proteins
Protein Network Analysis
Protein Network Analysis
- Network: 432 nodes, 1,245 edges (STRING confidence > 0.7)
- Modules Detected: 8 functional modules
- Module 1: Cell cycle (85 proteins)
- Module 2: Metabolism (62 proteins)
- Module 3: Translation (48 proteins)
- Network: 432 nodes, 1,245 edges (STRING confidence > 0.7)
- Modules Detected: 8 functional modules
- Module 1: Cell cycle (85 proteins)
- Module 2: Metabolism (62 proteins)
- Module 3: Translation (48 proteins)
Protein-RNA Correlation
Protein-RNA Correlation
- Overall Correlation: r = 0.54 (moderate, expected)
- High Correlation: 2,134 genes (r > 0.6) - transcriptional regulation
- Low Correlation: 456 genes (r < 0.2) - post-transcriptional regulation
- Translation-Regulated: 89 proteins (high protein, low RNA)
- Overall Correlation: r = 0.54 (moderate, expected)
- High Correlation: 2,134 genes (r > 0.6) - transcriptional regulation
- Low Correlation: 456 genes (r < 0.2) - post-transcriptional regulation
- Translation-Regulated: 89 proteins (high protein, low RNA)
Biological Interpretation
Biological Interpretation
Disease state shows increased proliferation (MYC, cyclins) with concurrent
suppression of apoptosis and DNA repair (TP53, BRCA1). Metabolic shift toward
glycolysis evident at protein level. Post-transcriptional upregulation of
translation machinery suggests adaptation to proliferative demands.
Disease state shows increased proliferation (MYC, cyclins) with concurrent
suppression of apoptosis and DNA repair (TP53, BRCA1). Metabolic shift toward
glycolysis evident at protein level. Post-transcriptional upregulation of
translation machinery suggests adaptation to proliferative demands.
Potential Biomarkers
Potential Biomarkers
Top 10 proteins for disease classification (Random Forest AUC=0.95):
- MYC (protein)
- EGFR (protein)
- CDK1 (phospho-T161)
- TP53 (protein)
- BRCA1 (protein)
---Top 10 proteins for disease classification (Random Forest AUC=0.95):
- MYC (protein)
- EGFR (protein)
- CDK1 (phospho-T161)
- TP53 (protein)
- BRCA1 (protein)
---Integration with ToolUniverse
与ToolUniverse的整合
Skills Coordinated:
| Skill | Used For | Phase |
|---|---|---|
| Pathway enrichment | Phase 5 |
| PPI networks | Phase 6 |
| RNA-seq for integration | Phase 7 |
| Cross-omics analysis | Phase 7 |
| Protein annotation | Phase 8 |
协同使用的技能:
| 技能 | 用途 | 阶段 |
|---|---|---|
| 通路富集分析 | 阶段5 |
| PPI网络分析 | 阶段6 |
| 用于整合的RNA-seq分析 | 阶段7 |
| 跨组学分析 | 阶段7 |
| 蛋白质注释 | 阶段8 |
Example Use Cases
示例用例
Use Case 1: Cancer Proteomics
用例1:癌症蛋白质组学
Question: "Analyze proteomics data from breast cancer vs normal tissue"
Workflow:
- Load MaxQuant proteinGroups.txt
- QC and filter (keep proteins with 2+ peptides, detected in 3+ samples)
- Impute missing, normalize by median
- Differential expression (limma): 432 significant proteins
- Pathway enrichment: Cell cycle, metabolism upregulated
- STRING network: Identify hub proteins (MYC, EGFR)
- Integrate with TCGA RNA-seq: Find translation-regulated genes
- Report: Comprehensive analysis with biomarkers
问题: "分析乳腺癌与正常组织的蛋白质组学数据"
工作流程:
- 加载MaxQuant的proteinGroups.txt文件
- 质量控制与过滤(保留含2个以上肽段、在3个以上样本中被检测到的蛋白质)
- 缺失值插补,中位数归一化
- 差异表达分析(limma):得到432个显著差异蛋白质
- 通路富集分析:细胞周期、代谢通路上调
- STRING网络分析:识别核心蛋白质(MYC、EGFR)
- 与TCGA RNA-seq数据整合:寻找翻译调控基因
- 生成报告:包含生物标志物的综合性分析报告
Use Case 2: Phosphoproteomics Signaling
用例2:磷酸化蛋白质组学信号分析
Question: "What kinase signaling is activated in response to drug treatment?"
Workflow:
- Load Phospho (STY)Sites.txt from MaxQuant
- Filter by localization probability > 0.75
- Differential phosphorylation analysis
- Kinase prediction for significant sites
- Identify MAPK1, CDK1, AKT1 as top kinases
- Pathway enrichment: MAPK, PI3K/AKT pathways
- Report: Drug activates growth signaling
问题: "药物处理后哪些激酶信号通路被激活?"
工作流程:
- 加载MaxQuant的Phospho (STY)Sites.txt文件
- 过滤定位概率>0.75的位点
- 差异磷酸化分析
- 对显著差异位点进行激酶预测
- 识别MAPK1、CDK1、AKT1为核心激酶
- 通路富集分析:MAPK、PI3K/AKT通路激活
- 生成报告:药物激活生长信号通路
Use Case 3: Protein-RNA Integration
用例3:蛋白质-RNA整合分析
Question: "Which proteins are regulated post-transcriptionally?"
Workflow:
- Load proteomics (MaxQuant) and RNA-seq (DESeq2) data
- Match samples, extract common genes
- Correlate protein and RNA for each gene
- Identify low-correlation genes (r < 0.2)
- Classify: translation upregulation, protein degradation
- Enrichment: Find pathways enriched in post-transcriptional regulation
- Report: 89 translation-regulated proteins, RNA-binding proteins enriched
问题: "哪些蛋白质受转录后调控?"
工作流程:
- 加载蛋白质组学(MaxQuant)与RNA-seq(DESeq2)数据
- 匹配样本,提取共有基因
- 对每个基因进行蛋白质与RNA水平的相关性分析
- 识别低相关性基因(r < 0.2)
- 分类:翻译上调、蛋白质降解等
- 富集分析:寻找转录后调控相关通路
- 生成报告:89个翻译调控蛋白质,RNA结合蛋白富集
Quantified Minimums
最低要求
| Component | Requirement |
|---|---|
| Proteins quantified | At least 500 proteins |
| Replicates | At least 3 per condition |
| Filtering | 2+ unique peptides per protein |
| Statistical test | limma or t-test with multiple testing correction |
| Pathway enrichment | At least one method (GO, KEGG, or Reactome) |
| Report | Summary, QC, DE results, pathways, visualizations |
| 组件 | 要求 |
|---|---|
| 定量蛋白质数量 | 至少500个蛋白质 |
| 生物学重复 | 每个条件至少3个重复 |
| 过滤标准 | 每个蛋白质至少2个独特肽段 |
| 统计检验 | 使用limma或t-test并进行多重检验校正 |
| 通路富集分析 | 至少使用一种方法(GO、KEGG或Reactome) |
| 报告内容 | 包含摘要、质量控制、差异表达结果、通路分析、可视化图表 |
Limitations
局限性
- Platform-specific: Optimized for MS-based proteomics (not Western blot quantification)
- Missing values: High missing rate (>50% per protein) limits statistical power
- PTM analysis: Requires enrichment protocols for comprehensive PTM profiling
- Absolute quantification: Relative abundance only (unless TMT/SILAC used)
- Protein isoforms: Typically collapsed to gene level
- Dynamic range: MS has limited dynamic range vs mRNA sequencing
- 平台特异性:针对基于质谱的蛋白质组学优化(不支持Western blot定量数据)
- 缺失值限制:高缺失率(每个蛋白质缺失率>50%)会降低统计效力
- PTM分析限制:需要富集实验才能实现全面的PTM分析
- 定量类型:仅支持相对丰度分析(除非使用TMT/SILAC技术)
- 蛋白质异构体:通常折叠到基因水平分析
- 动态范围:质谱的动态范围低于mRNA测序
References
参考文献
Methods:
- MaxQuant: https://doi.org/10.1038/nbt.1511
- Limma for proteomics: https://doi.org/10.1093/nar/gkv007
- DEP workflow: https://doi.org/10.1038/nprot.2018.107
Databases:
- STRING: https://string-db.org
- PhosphoSitePlus: https://www.phosphosite.org
- CORUM: https://mips.helmholtz-muenchen.de/corum
方法类:
- MaxQuant: https://doi.org/10.1038/nbt.1511
- Limma for proteomics: https://doi.org/10.1093/nar/gkv007
- DEP workflow: https://doi.org/10.1038/nprot.2018.107
数据库类:
- STRING: https://string-db.org
- PhosphoSitePlus: https://www.phosphosite.org
- CORUM: https://mips.helmholtz-muenchen.de/corum