geo-database
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGEO Database
GEO数据库
Overview
概述
The Gene Expression Omnibus (GEO) is NCBI's public repository for high-throughput gene expression and functional genomics data. GEO contains over 264,000 studies with more than 8 million samples from both array-based and sequence-based experiments.
基因表达综合数据库(GEO)是NCBI的高通量基因表达和功能基因组学数据公共存储库。GEO包含超过264,000项研究,涵盖来自基于阵列和基于测序实验的800多万个样本。
When to Use This Skill
何时使用该技能
This skill should be used when searching for gene expression datasets, retrieving experimental data, downloading raw and processed files, querying expression profiles, or integrating GEO data into computational analysis workflows.
当你需要搜索基因表达数据集、检索实验数据、下载原始和处理后的文件、查询表达谱,或者将GEO数据整合到计算分析工作流中时,应该使用该技能。
Core Capabilities
核心功能
1. Understanding GEO Data Organization
1. 了解GEO数据组织方式
GEO organizes data hierarchically using different accession types:
Series (GSE): A complete experiment with a set of related samples
- Example: GSE123456
- Contains experimental design, samples, and overall study information
- Largest organizational unit in GEO
- Current count: 264,928+ series
Sample (GSM): A single experimental sample or biological replicate
- Example: GSM987654
- Contains individual sample data, protocols, and metadata
- Linked to platforms and series
- Current count: 8,068,632+ samples
Platform (GPL): The microarray or sequencing platform used
- Example: GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array)
- Describes the technology and probe/feature annotations
- Shared across multiple experiments
- Current count: 27,739+ platforms
DataSet (GDS): Curated collections with consistent formatting
- Example: GDS5678
- Experimentally-comparable samples organized by study design
- Processed for differential analysis
- Subset of GEO data (4,348 curated datasets)
- Ideal for quick comparative analyses
Profiles: Gene-specific expression data linked to sequence features
- Queryable by gene name or annotation
- Cross-references to Entrez Gene
- Enables gene-centric searches across all studies
GEO使用不同的登录号类型对数据进行分层组织:
**系列(GSE):**包含一组相关样本的完整实验
- 示例:GSE123456
- 包含实验设计、样本和整体研究信息
- GEO中最大的组织单元
- 当前数量:264,928+个系列
**样本(GSM):**单个实验样本或生物学重复
- 示例:GSM987654
- 包含单个样本数据、实验方案和元数据
- 与平台和系列相关联
- 当前数量:8,068,632+个样本
**平台(GPL):**使用的微阵列或测序平台
- 示例:GPL570(Affymetrix人类基因组U133 Plus 2.0阵列)
- 描述技术和探针/特征注释
- 可在多个实验间共享
- 当前数量:27,739+个平台
**数据集(GDS):**格式一致的 curated 集合
- 示例:GDS5678
- 按研究设计组织的具有实验可比性的样本
- 经过处理可用于差异分析
- GEO数据的子集(4,348个 curated 数据集)
- 非常适合快速比较分析
**表达谱:**与序列特征关联的基因特异性表达数据
- 可通过基因名称或注释进行查询
- 与Entrez Gene交叉引用
- 支持跨所有研究的基因中心搜索
2. Searching GEO Data
2. 搜索GEO数据
GEO DataSets Search:
Search for studies by keywords, organism, or experimental conditions:
python
from Bio import EntrezGEO数据集搜索:
按关键词、生物或实验条件搜索研究:
python
from Bio import EntrezConfigure Entrez (required)
Configure Entrez (required)
Entrez.email = "your.email@example.com"
Entrez.email = "your.email@example.com"
Search for datasets
Search for datasets
def search_geo_datasets(query, retmax=20):
"""Search GEO DataSets database"""
handle = Entrez.esearch(
db="gds",
term=query,
retmax=retmax,
usehistory="y"
)
results = Entrez.read(handle)
handle.close()
return results
def search_geo_datasets(query, retmax=20):
"""Search GEO DataSets database"""
handle = Entrez.esearch(
db="gds",
term=query,
retmax=retmax,
usehistory="y"
)
results = Entrez.read(handle)
handle.close()
return results
Example searches
Example searches
results = search_geo_datasets("breast cancer[MeSH] AND Homo sapiens[Organism]")
print(f"Found {results['Count']} datasets")
results = search_geo_datasets("breast cancer[MeSH] AND Homo sapiens[Organism]")
print(f"Found {results['Count']} datasets")
Search by specific platform
Search by specific platform
results = search_geo_datasets("GPL570[Accession]")
results = search_geo_datasets("GPL570[Accession]")
Search by study type
Search by study type
results = search_geo_datasets("expression profiling by array[DataSet Type]")
**GEO Profiles Search:**
Find gene-specific expression patterns:
```pythonresults = search_geo_datasets("expression profiling by array[DataSet Type]")
**GEO表达谱搜索:**
查找基因特异性表达模式:
```pythonSearch for gene expression profiles
Search for gene expression profiles
def search_geo_profiles(gene_name, organism="Homo sapiens", retmax=100):
"""Search GEO Profiles for a specific gene"""
query = f"{gene_name}[Gene Name] AND {organism}[Organism]"
handle = Entrez.esearch(
db="geoprofiles",
term=query,
retmax=retmax
)
results = Entrez.read(handle)
handle.close()
return results
def search_geo_profiles(gene_name, organism="Homo sapiens", retmax=100):
"""Search GEO Profiles for a specific gene"""
query = f"{gene_name}[Gene Name] AND {organism}[Organism]"
handle = Entrez.esearch(
db="geoprofiles",
term=query,
retmax=retmax
)
results = Entrez.read(handle)
handle.close()
return results
Find TP53 expression across studies
Find TP53 expression across studies
tp53_results = search_geo_profiles("TP53", organism="Homo sapiens")
print(f"Found {tp53_results['Count']} expression profiles for TP53")
**Advanced Search Patterns:**
```pythontp53_results = search_geo_profiles("TP53", organism="Homo sapiens")
print(f"Found {tp53_results['Count']} expression profiles for TP53")
**高级搜索模式:**
```pythonCombine multiple search terms
Combine multiple search terms
def advanced_geo_search(terms, operator="AND"):
"""Build complex search queries"""
query = f" {operator} ".join(terms)
return search_geo_datasets(query)
def advanced_geo_search(terms, operator="AND"):
"""Build complex search queries"""
query = f" {operator} ".join(terms)
return search_geo_datasets(query)
Find recent high-throughput studies
Find recent high-throughput studies
search_terms = [
"RNA-seq[DataSet Type]",
"Homo sapiens[Organism]",
"2024[Publication Date]"
]
results = advanced_geo_search(search_terms)
search_terms = [
"RNA-seq[DataSet Type]",
"Homo sapiens[Organism]",
"2024[Publication Date]"
]
results = advanced_geo_search(search_terms)
Search by author and condition
Search by author and condition
search_terms = [
"Smith[Author]",
"diabetes[Disease]"
]
results = advanced_geo_search(search_terms)
undefinedsearch_terms = [
"Smith[Author]",
"diabetes[Disease]"
]
results = advanced_geo_search(search_terms)
undefined3. Retrieving GEO Data with GEOparse (Recommended)
3. 使用GEOparse检索GEO数据(推荐)
GEOparse is the primary Python library for accessing GEO data:
Installation:
bash
uv pip install GEOparseBasic Usage:
python
import GEOparseGEOparse是访问GEO数据的主要Python库:
安装:
bash
uv pip install GEOparse基本用法:
python
import GEOparseDownload and parse a GEO Series
Download and parse a GEO Series
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
Access series metadata
Access series metadata
print(gse.metadata['title'])
print(gse.metadata['summary'])
print(gse.metadata['overall_design'])
print(gse.metadata['title'])
print(gse.metadata['summary'])
print(gse.metadata['overall_design'])
Access sample information
Access sample information
for gsm_name, gsm in gse.gsms.items():
print(f"Sample: {gsm_name}")
print(f" Title: {gsm.metadata['title'][0]}")
print(f" Source: {gsm.metadata['source_name_ch1'][0]}")
print(f" Characteristics: {gsm.metadata.get('characteristics_ch1', [])}")
for gsm_name, gsm in gse.gsms.items():
print(f"Sample: {gsm_name}")
print(f" Title: {gsm.metadata['title'][0]}")
print(f" Source: {gsm.metadata['source_name_ch1'][0]}")
print(f" Characteristics: {gsm.metadata.get('characteristics_ch1', [])}")
Access platform information
Access platform information
for gpl_name, gpl in gse.gpls.items():
print(f"Platform: {gpl_name}")
print(f" Title: {gpl.metadata['title'][0]}")
print(f" Organism: {gpl.metadata['organism'][0]}")
**Working with Expression Data:**
```python
import GEOparse
import pandas as pdfor gpl_name, gpl in gse.gpls.items():
print(f"Platform: {gpl_name}")
print(f" Title: {gpl.metadata['title'][0]}")
print(f" Organism: {gpl.metadata['organism'][0]}")
**处理表达数据:**
```python
import GEOparse
import pandas as pdGet expression data from series
Get expression data from series
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
Extract expression matrix
Extract expression matrix
Method 1: From series matrix file (fastest)
Method 1: From series matrix file (fastest)
if hasattr(gse, 'pivot_samples'):
expression_df = gse.pivot_samples('VALUE')
print(expression_df.shape) # genes x samples
if hasattr(gse, 'pivot_samples'):
expression_df = gse.pivot_samples('VALUE')
print(expression_df.shape) # genes x samples
Method 2: From individual samples
Method 2: From individual samples
expression_data = {}
for gsm_name, gsm in gse.gsms.items():
if hasattr(gsm, 'table'):
expression_data[gsm_name] = gsm.table['VALUE']
expression_df = pd.DataFrame(expression_data)
print(f"Expression matrix: {expression_df.shape}")
**Accessing Supplementary Files:**
```python
import GEOparse
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")expression_data = {}
for gsm_name, gsm in gse.gsms.items():
if hasattr(gsm, 'table'):
expression_data[gsm_name] = gsm.table['VALUE']
expression_df = pd.DataFrame(expression_data)
print(f"Expression matrix: {expression_df.shape}")
**访问补充文件:**
```python
import GEOparse
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")Download supplementary files
Download supplementary files
gse.download_supplementary_files(
directory="./data/GSE123456_suppl",
download_sra=False # Set to True to download SRA files
)
gse.download_supplementary_files(
directory="./data/GSE123456_suppl",
download_sra=False # Set to True to download SRA files
)
List available supplementary files
List available supplementary files
for gsm_name, gsm in gse.gsms.items():
if hasattr(gsm, 'supplementary_files'):
print(f"Sample {gsm_name}:")
for file_url in gsm.metadata.get('supplementary_file', []):
print(f" {file_url}")
**Filtering and Subsetting Data:**
```python
import GEOparse
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")for gsm_name, gsm in gse.gsms.items():
if hasattr(gsm, 'supplementary_files'):
print(f"Sample {gsm_name}:")
for file_url in gsm.metadata.get('supplementary_file', []):
print(f" {file_url}")
**过滤和子集化数据:**
```python
import GEOparse
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")Filter samples by metadata
Filter samples by metadata
control_samples = [
gsm_name for gsm_name, gsm in gse.gsms.items()
if 'control' in gsm.metadata.get('title', [''])[0].lower()
]
treatment_samples = [
gsm_name for gsm_name, gsm in gse.gsms.items()
if 'treatment' in gsm.metadata.get('title', [''])[0].lower()
]
print(f"Control samples: {len(control_samples)}")
print(f"Treatment samples: {len(treatment_samples)}")
control_samples = [
gsm_name for gsm_name, gsm in gse.gsms.items()
if 'control' in gsm.metadata.get('title', [''])[0].lower()
]
treatment_samples = [
gsm_name for gsm_name, gsm in gse.gsms.items()
if 'treatment' in gsm.metadata.get('title', [''])[0].lower()
]
print(f"Control samples: {len(control_samples)}")
print(f"Treatment samples: {len(treatment_samples)}")
Extract subset expression matrix
Extract subset expression matrix
expression_df = gse.pivot_samples('VALUE')
control_expr = expression_df[control_samples]
treatment_expr = expression_df[treatment_samples]
undefinedexpression_df = gse.pivot_samples('VALUE')
control_expr = expression_df[control_samples]
treatment_expr = expression_df[treatment_samples]
undefined4. Using NCBI E-utilities for GEO Access
4. 使用NCBI E-utilities访问GEO
E-utilities provide lower-level programmatic access to GEO metadata:
Basic E-utilities Workflow:
python
from Bio import Entrez
import time
Entrez.email = "your.email@example.com"E-utilities提供对GEO元数据的底层编程访问:
基本E-utilities工作流:
python
from Bio import Entrez
import time
Entrez.email = "your.email@example.com"Step 1: Search for GEO entries
Step 1: Search for GEO entries
def search_geo(query, db="gds", retmax=100):
"""Search GEO using E-utilities"""
handle = Entrez.esearch(
db=db,
term=query,
retmax=retmax,
usehistory="y"
)
results = Entrez.read(handle)
handle.close()
return results
def search_geo(query, db="gds", retmax=100):
"""Search GEO using E-utilities"""
handle = Entrez.esearch(
db=db,
term=query,
retmax=retmax,
usehistory="y"
)
results = Entrez.read(handle)
handle.close()
return results
Step 2: Fetch summaries
Step 2: Fetch summaries
def fetch_geo_summaries(id_list, db="gds"):
"""Fetch document summaries for GEO entries"""
ids = ",".join(id_list)
handle = Entrez.esummary(db=db, id=ids)
summaries = Entrez.read(handle)
handle.close()
return summaries
def fetch_geo_summaries(id_list, db="gds"):
"""Fetch document summaries for GEO entries"""
ids = ",".join(id_list)
handle = Entrez.esummary(db=db, id=ids)
summaries = Entrez.read(handle)
handle.close()
return summaries
Step 3: Fetch full records
Step 3: Fetch full records
def fetch_geo_records(id_list, db="gds"):
"""Fetch full GEO records"""
ids = ",".join(id_list)
handle = Entrez.efetch(db=db, id=ids, retmode="xml")
records = Entrez.read(handle)
handle.close()
return records
def fetch_geo_records(id_list, db="gds"):
"""Fetch full GEO records"""
ids = ",".join(id_list)
handle = Entrez.efetch(db=db, id=ids, retmode="xml")
records = Entrez.read(handle)
handle.close()
return records
Example workflow
Example workflow
search_results = search_geo("breast cancer AND Homo sapiens")
id_list = search_results['IdList'][:5]
summaries = fetch_geo_summaries(id_list)
for summary in summaries:
print(f"GDS: {summary.get('Accession', 'N/A')}")
print(f"Title: {summary.get('title', 'N/A')}")
print(f"Samples: {summary.get('n_samples', 'N/A')}")
print()
**Batch Processing with E-utilities:**
```python
from Bio import Entrez
import time
Entrez.email = "your.email@example.com"
def batch_fetch_geo_metadata(accessions, batch_size=100):
"""Fetch metadata for multiple GEO accessions"""
results = {}
for i in range(0, len(accessions), batch_size):
batch = accessions[i:i + batch_size]
# Search for each accession
for accession in batch:
try:
query = f"{accession}[Accession]"
search_handle = Entrez.esearch(db="gds", term=query)
search_results = Entrez.read(search_handle)
search_handle.close()
if search_results['IdList']:
# Fetch summary
summary_handle = Entrez.esummary(
db="gds",
id=search_results['IdList'][0]
)
summary = Entrez.read(summary_handle)
summary_handle.close()
results[accession] = summary[0]
# Be polite to NCBI servers
time.sleep(0.34) # Max 3 requests per second
except Exception as e:
print(f"Error fetching {accession}: {e}")
return resultssearch_results = search_geo("breast cancer AND Homo sapiens")
id_list = search_results['IdList'][:5]
summaries = fetch_geo_summaries(id_list)
for summary in summaries:
print(f"GDS: {summary.get('Accession', 'N/A')}")
print(f"Title: {summary.get('title', 'N/A')}")
print(f"Samples: {summary.get('n_samples', 'N/A')}")
print()
**使用E-utilities进行批量处理:**
```python
from Bio import Entrez
import time
Entrez.email = "your.email@example.com"
def batch_fetch_geo_metadata(accessions, batch_size=100):
"""Fetch metadata for multiple GEO accessions"""
results = {}
for i in range(0, len(accessions), batch_size):
batch = accessions[i:i + batch_size]
# Search for each accession
for accession in batch:
try:
query = f"{accession}[Accession]"
search_handle = Entrez.esearch(db="gds", term=query)
search_results = Entrez.read(search_handle)
search_handle.close()
if search_results['IdList']:
# Fetch summary
summary_handle = Entrez.esummary(
db="gds",
id=search_results['IdList'][0]
)
summary = Entrez.read(summary_handle)
summary_handle.close()
results[accession] = summary[0]
# Be polite to NCBI servers
time.sleep(0.34) # Max 3 requests per second
except Exception as e:
print(f"Error fetching {accession}: {e}")
return resultsFetch metadata for multiple datasets
Fetch metadata for multiple datasets
gse_list = ["GSE100001", "GSE100002", "GSE100003"]
metadata = batch_fetch_geo_metadata(gse_list)
undefinedgse_list = ["GSE100001", "GSE100002", "GSE100003"]
metadata = batch_fetch_geo_metadata(gse_list)
undefined5. Direct FTP Access for Data Files
5. 直接通过FTP访问数据文件
FTP URLs for GEO Data:
GEO data can be downloaded directly via FTP:
python
import ftplib
import os
def download_geo_ftp(accession, file_type="matrix", dest_dir="./data"):
"""Download GEO files via FTP"""
# Construct FTP path based on accession type
if accession.startswith("GSE"):
# Series files
gse_num = accession[3:]
base_num = gse_num[:-3] + "nnn"
ftp_path = f"/geo/series/GSE{base_num}/{accession}/"
if file_type == "matrix":
filename = f"{accession}_series_matrix.txt.gz"
elif file_type == "soft":
filename = f"{accession}_family.soft.gz"
elif file_type == "miniml":
filename = f"{accession}_family.xml.tgz"
# Connect to FTP server
ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov")
ftp.login()
ftp.cwd(ftp_path)
# Download file
os.makedirs(dest_dir, exist_ok=True)
local_file = os.path.join(dest_dir, filename)
with open(local_file, 'wb') as f:
ftp.retrbinary(f'RETR {filename}', f.write)
ftp.quit()
print(f"Downloaded: {local_file}")
return local_fileGEO数据的FTP地址:
可通过FTP直接下载GEO数据:
python
import ftplib
import os
def download_geo_ftp(accession, file_type="matrix", dest_dir="./data"):
"""Download GEO files via FTP"""
# Construct FTP path based on accession type
if accession.startswith("GSE"):
# Series files
gse_num = accession[3:]
base_num = gse_num[:-3] + "nnn"
ftp_path = f"/geo/series/GSE{base_num}/{accession}/"
if file_type == "matrix":
filename = f"{accession}_series_matrix.txt.gz"
elif file_type == "soft":
filename = f"{accession}_family.soft.gz"
elif file_type == "miniml":
filename = f"{accession}_family.xml.tgz"
# Connect to FTP server
ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov")
ftp.login()
ftp.cwd(ftp_path)
# Download file
os.makedirs(dest_dir, exist_ok=True)
local_file = os.path.join(dest_dir, filename)
with open(local_file, 'wb') as f:
ftp.retrbinary(f'RETR {filename}', f.write)
ftp.quit()
print(f"Downloaded: {local_file}")
return local_fileDownload series matrix file
Download series matrix file
download_geo_ftp("GSE123456", file_type="matrix")
download_geo_ftp("GSE123456", file_type="matrix")
Download SOFT format file
Download SOFT format file
download_geo_ftp("GSE123456", file_type="soft")
**Using wget or curl for Downloads:**
```bashdownload_geo_ftp("GSE123456", file_type="soft")
**使用wget或curl下载:**
```bashDownload series matrix file
Download series matrix file
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/matrix/GSE123456_series_matrix.txt.gz
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/matrix/GSE123456_series_matrix.txt.gz
Download all supplementary files for a series
Download all supplementary files for a series
wget -r -np -nd ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/suppl/
wget -r -np -nd ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/suppl/
Download SOFT format family file
Download SOFT format family file
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/soft/GSE123456_family.soft.gz
undefinedwget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/soft/GSE123456_family.soft.gz
undefined6. Analyzing GEO Data
6. 分析GEO数据
Quality Control and Preprocessing:
python
import GEOparse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt质量控制与预处理:
python
import GEOparse
import pandas as pd
import numpy as np
import matplotlib.pyplot as pltLoad dataset
Load dataset
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')
Check for missing values
Check for missing values
print(f"Missing values: {expression_df.isnull().sum().sum()}")
print(f"Missing values: {expression_df.isnull().sum().sum()}")
Log transformation (if needed)
Log transformation (if needed)
if expression_df.min().min() > 0: # Check if already log-transformed
if expression_df.max().max() > 100:
expression_df = np.log2(expression_df + 1)
print("Applied log2 transformation")
if expression_df.min().min() > 0: # Check if already log-transformed
if expression_df.max().max() > 100:
expression_df = np.log2(expression_df + 1)
print("Applied log2 transformation")
Distribution plots
Distribution plots
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
expression_df.plot.box(ax=plt.gca())
plt.title("Expression Distribution per Sample")
plt.xticks(rotation=90)
plt.subplot(1, 2, 2)
expression_df.mean(axis=1).hist(bins=50)
plt.title("Gene Expression Distribution")
plt.xlabel("Average Expression")
plt.tight_layout()
plt.savefig("geo_qc.png", dpi=300, bbox_inches='tight')
**Differential Expression Analysis:**
```python
import GEOparse
import pandas as pd
import numpy as np
from scipy import stats
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
expression_df.plot.box(ax=plt.gca())
plt.title("Expression Distribution per Sample")
plt.xticks(rotation=90)
plt.subplot(1, 2, 2)
expression_df.mean(axis=1).hist(bins=50)
plt.title("Gene Expression Distribution")
plt.xlabel("Average Expression")
plt.tight_layout()
plt.savefig("geo_qc.png", dpi=300, bbox_inches='tight')
**差异表达分析:**
```python
import GEOparse
import pandas as pd
import numpy as np
from scipy import stats
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')Define sample groups
Define sample groups
control_samples = ["GSM1", "GSM2", "GSM3"]
treatment_samples = ["GSM4", "GSM5", "GSM6"]
control_samples = ["GSM1", "GSM2", "GSM3"]
treatment_samples = ["GSM4", "GSM5", "GSM6"]
Calculate fold changes and p-values
Calculate fold changes and p-values
results = []
for gene in expression_df.index:
control_expr = expression_df.loc[gene, control_samples]
treatment_expr = expression_df.loc[gene, treatment_samples]
# Calculate statistics
fold_change = treatment_expr.mean() - control_expr.mean()
t_stat, p_value = stats.ttest_ind(treatment_expr, control_expr)
results.append({
'gene': gene,
'log2_fold_change': fold_change,
'p_value': p_value,
'control_mean': control_expr.mean(),
'treatment_mean': treatment_expr.mean()
})results = []
for gene in expression_df.index:
control_expr = expression_df.loc[gene, control_samples]
treatment_expr = expression_df.loc[gene, treatment_samples]
# Calculate statistics
fold_change = treatment_expr.mean() - control_expr.mean()
t_stat, p_value = stats.ttest_ind(treatment_expr, control_expr)
results.append({
'gene': gene,
'log2_fold_change': fold_change,
'p_value': p_value,
'control_mean': control_expr.mean(),
'treatment_mean': treatment_expr.mean()
})Create results DataFrame
Create results DataFrame
de_results = pd.DataFrame(results)
de_results = pd.DataFrame(results)
Multiple testing correction (Benjamini-Hochberg)
Multiple testing correction (Benjamini-Hochberg)
from statsmodels.stats.multitest import multipletests
_, de_results['q_value'], _, _ = multipletests(
de_results['p_value'],
method='fdr_bh'
)
from statsmodels.stats.multitest import multipletests
_, de_results['q_value'], _, _ = multipletests(
de_results['p_value'],
method='fdr_bh'
)
Filter significant genes
Filter significant genes
significant_genes = de_results[
(de_results['q_value'] < 0.05) &
(abs(de_results['log2_fold_change']) > 1)
]
print(f"Significant genes: {len(significant_genes)}")
significant_genes.to_csv("de_results.csv", index=False)
**Correlation and Clustering Analysis:**
```python
import GEOparse
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')significant_genes = de_results[
(de_results['q_value'] < 0.05) &
(abs(de_results['log2_fold_change']) > 1)
]
print(f"Significant genes: {len(significant_genes)}")
significant_genes.to_csv("de_results.csv", index=False)
**相关性与聚类分析:**
```python
import GEOparse
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')Sample correlation heatmap
Sample correlation heatmap
sample_corr = expression_df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(sample_corr, cmap='coolwarm', center=0,
square=True, linewidths=0.5)
plt.title("Sample Correlation Matrix")
plt.tight_layout()
plt.savefig("sample_correlation.png", dpi=300, bbox_inches='tight')
sample_corr = expression_df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(sample_corr, cmap='coolwarm', center=0,
square=True, linewidths=0.5)
plt.title("Sample Correlation Matrix")
plt.tight_layout()
plt.savefig("sample_correlation.png", dpi=300, bbox_inches='tight')
Hierarchical clustering
Hierarchical clustering
distances = pdist(expression_df.T, metric='correlation')
linkage = hierarchy.linkage(distances, method='average')
plt.figure(figsize=(12, 6))
hierarchy.dendrogram(linkage, labels=expression_df.columns)
plt.title("Hierarchical Clustering of Samples")
plt.xlabel("Samples")
plt.ylabel("Distance")
plt.xticks(rotation=90)
plt.tight_layout()
plt.savefig("sample_clustering.png", dpi=300, bbox_inches='tight')
undefineddistances = pdist(expression_df.T, metric='correlation')
linkage = hierarchy.linkage(distances, method='average')
plt.figure(figsize=(12, 6))
hierarchy.dendrogram(linkage, labels=expression_df.columns)
plt.title("Hierarchical Clustering of Samples")
plt.xlabel("Samples")
plt.ylabel("Distance")
plt.xticks(rotation=90)
plt.tight_layout()
plt.savefig("sample_clustering.png", dpi=300, bbox_inches='tight')
undefined7. Batch Processing Multiple Datasets
7. 批量处理多个数据集
Download and Process Multiple Series:
python
import GEOparse
import pandas as pd
import os
def batch_download_geo(gse_list, destdir="./geo_data"):
"""Download multiple GEO series"""
results = {}
for gse_id in gse_list:
try:
print(f"Processing {gse_id}...")
gse = GEOparse.get_GEO(geo=gse_id, destdir=destdir)
# Extract key information
results[gse_id] = {
'title': gse.metadata.get('title', ['N/A'])[0],
'organism': gse.metadata.get('organism', ['N/A'])[0],
'platform': list(gse.gpls.keys())[0] if gse.gpls else 'N/A',
'num_samples': len(gse.gsms),
'submission_date': gse.metadata.get('submission_date', ['N/A'])[0]
}
# Save expression data
if hasattr(gse, 'pivot_samples'):
expr_df = gse.pivot_samples('VALUE')
expr_df.to_csv(f"{destdir}/{gse_id}_expression.csv")
results[gse_id]['num_genes'] = len(expr_df)
except Exception as e:
print(f"Error processing {gse_id}: {e}")
results[gse_id] = {'error': str(e)}
# Save summary
summary_df = pd.DataFrame(results).T
summary_df.to_csv(f"{destdir}/batch_summary.csv")
return results下载并处理多个系列:
python
import GEOparse
import pandas as pd
import os
def batch_download_geo(gse_list, destdir="./geo_data"):
"""Download multiple GEO series"""
results = {}
for gse_id in gse_list:
try:
print(f"Processing {gse_id}...")
gse = GEOparse.get_GEO(geo=gse_id, destdir=destdir)
# Extract key information
results[gse_id] = {
'title': gse.metadata.get('title', ['N/A'])[0],
'organism': gse.metadata.get('organism', ['N/A'])[0],
'platform': list(gse.gpls.keys())[0] if gse.gpls else 'N/A',
'num_samples': len(gse.gsms),
'submission_date': gse.metadata.get('submission_date', ['N/A'])[0]
}
# Save expression data
if hasattr(gse, 'pivot_samples'):
expr_df = gse.pivot_samples('VALUE')
expr_df.to_csv(f"{destdir}/{gse_id}_expression.csv")
results[gse_id]['num_genes'] = len(expr_df)
except Exception as e:
print(f"Error processing {gse_id}: {e}")
results[gse_id] = {'error': str(e)}
# Save summary
summary_df = pd.DataFrame(results).T
summary_df.to_csv(f"{destdir}/batch_summary.csv")
return resultsProcess multiple datasets
Process multiple datasets
gse_list = ["GSE100001", "GSE100002", "GSE100003"]
results = batch_download_geo(gse_list)
**Meta-Analysis Across Studies:**
```python
import GEOparse
import pandas as pd
import numpy as np
def meta_analysis_geo(gse_list, gene_of_interest):
"""Perform meta-analysis of gene expression across studies"""
results = []
for gse_id in gse_list:
try:
gse = GEOparse.get_GEO(geo=gse_id, destdir="./data")
# Get platform annotation
gpl = list(gse.gpls.values())[0]
# Find gene in platform
if hasattr(gpl, 'table'):
gene_probes = gpl.table[
gpl.table['Gene Symbol'].str.contains(
gene_of_interest,
case=False,
na=False
)
]
if not gene_probes.empty:
expr_df = gse.pivot_samples('VALUE')
for probe_id in gene_probes['ID']:
if probe_id in expr_df.index:
expr_values = expr_df.loc[probe_id]
results.append({
'study': gse_id,
'probe': probe_id,
'mean_expression': expr_values.mean(),
'std_expression': expr_values.std(),
'num_samples': len(expr_values)
})
except Exception as e:
print(f"Error in {gse_id}: {e}")
return pd.DataFrame(results)gse_list = ["GSE100001", "GSE100002", "GSE100003"]
results = batch_download_geo(gse_list)
**跨研究的荟萃分析:**
```python
import GEOparse
import pandas as pd
import numpy as np
def meta_analysis_geo(gse_list, gene_of_interest):
"""Perform meta-analysis of gene expression across studies"""
results = []
for gse_id in gse_list:
try:
gse = GEOparse.get_GEO(geo=gse_id, destdir="./data")
# Get platform annotation
gpl = list(gse.gpls.values())[0]
# Find gene in platform
if hasattr(gpl, 'table'):
gene_probes = gpl.table[
gpl.table['Gene Symbol'].str.contains(
gene_of_interest,
case=False,
na=False
)
]
if not gene_probes.empty:
expr_df = gse.pivot_samples('VALUE')
for probe_id in gene_probes['ID']:
if probe_id in expr_df.index:
expr_values = expr_df.loc[probe_id]
results.append({
'study': gse_id,
'probe': probe_id,
'mean_expression': expr_values.mean(),
'std_expression': expr_values.std(),
'num_samples': len(expr_values)
})
except Exception as e:
print(f"Error in {gse_id}: {e}")
return pd.DataFrame(results)Meta-analysis for TP53
Meta-analysis for TP53
gse_studies = ["GSE100001", "GSE100002", "GSE100003"]
meta_results = meta_analysis_geo(gse_studies, "TP53")
print(meta_results)
undefinedgse_studies = ["GSE100001", "GSE100002", "GSE100003"]
meta_results = meta_analysis_geo(gse_studies, "TP53")
print(meta_results)
undefinedInstallation and Setup
安装与设置
Python Libraries
Python库
bash
undefinedbash
undefinedPrimary GEO access library (recommended)
Primary GEO access library (recommended)
uv pip install GEOparse
uv pip install GEOparse
For E-utilities and programmatic NCBI access
For E-utilities and programmatic NCBI access
uv pip install biopython
uv pip install biopython
For data analysis
For data analysis
uv pip install pandas numpy scipy
uv pip install pandas numpy scipy
For visualization
For visualization
uv pip install matplotlib seaborn
uv pip install matplotlib seaborn
For statistical analysis
For statistical analysis
uv pip install statsmodels scikit-learn
undefineduv pip install statsmodels scikit-learn
undefinedConfiguration
配置
Set up NCBI E-utilities access:
python
from Bio import Entrez设置NCBI E-utilities访问:
python
from Bio import EntrezAlways set your email (required by NCBI)
Always set your email (required by NCBI)
Entrez.email = "your.email@example.com"
Entrez.email = "your.email@example.com"
Optional: Set API key for increased rate limits
Optional: Set API key for increased rate limits
Get your API key from: https://www.ncbi.nlm.nih.gov/account/
Get your API key from: https://www.ncbi.nlm.nih.gov/account/
Entrez.api_key = "your_api_key_here"
Entrez.api_key = "your_api_key_here"
With API key: 10 requests/second
With API key: 10 requests/second
Without API key: 3 requests/second
Without API key: 3 requests/second
undefinedundefinedCommon Use Cases
常见用例
Transcriptomics Research
转录组学研究
- Download gene expression data for specific conditions
- Compare expression profiles across studies
- Identify differentially expressed genes
- Perform meta-analyses across multiple datasets
- 下载特定条件下的基因表达数据
- 跨研究比较表达谱
- 鉴定差异表达基因
- 跨多个数据集进行荟萃分析
Drug Response Studies
药物反应研究
- Analyze gene expression changes after drug treatment
- Identify biomarkers for drug response
- Compare drug effects across cell lines or patients
- Build predictive models for drug sensitivity
- 分析药物处理后的基因表达变化
- 鉴定药物反应的生物标志物
- 跨细胞系或患者比较药物效果
- 构建药物敏感性预测模型
Disease Biology
疾病生物学
- Study gene expression in disease vs. normal tissues
- Identify disease-associated expression signatures
- Compare patient subgroups and disease stages
- Correlate expression with clinical outcomes
- 研究疾病与正常组织中的基因表达
- 鉴定疾病相关的表达特征
- 比较患者亚组和疾病阶段
- 将表达与临床结果关联
Biomarker Discovery
生物标志物发现
- Screen for diagnostic or prognostic markers
- Validate biomarkers across independent cohorts
- Compare marker performance across platforms
- Integrate expression with clinical data
- 筛选诊断或预后标志物
- 在独立队列中验证生物标志物
- 跨平台比较标志物性能
- 将表达与临床数据整合
Key Concepts
关键概念
SOFT (Simple Omnibus Format in Text): GEO's primary text-based format containing metadata and data tables. Easily parsed by GEOparse.
MINiML (MIAME Notation in Markup Language): XML format for GEO data, used for programmatic access and data exchange.
Series Matrix: Tab-delimited expression matrix with samples as columns and genes/probes as rows. Fastest format for getting expression data.
MIAME Compliance: Minimum Information About a Microarray Experiment - standardized annotation that GEO enforces for all submissions.
Expression Value Types: Different types of expression measurements (raw signal, normalized, log-transformed). Always check platform and processing methods.
Platform Annotation: Maps probe/feature IDs to genes. Essential for biological interpretation of expression data.
**SOFT(Simple Omnibus Format in Text):**GEO的主要文本格式,包含元数据和数据表。可通过GEOparse轻松解析。
**MINiML(MIAME Notation in Markup Language):**GEO数据的XML格式,用于编程访问和数据交换。
**系列矩阵:**制表符分隔的表达矩阵,样本为列,基因/探针为行。获取表达数据的最快格式。
**MIAME合规性:**微阵列实验最小信息标准 - GEO对所有提交内容强制执行的标准化注释要求。
**表达值类型:**不同类型的表达测量值(原始信号、归一化值、对数转换值)。请始终检查平台和处理方法。
**平台注释:**将探针/特征ID映射到基因。对表达数据的生物学解释至关重要。
GEO2R Web Tool
GEO2R网络工具
For quick analysis without coding, use GEO2R:
- Web-based statistical analysis tool integrated into GEO
- Accessible at: https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSExxxxx
- Performs differential expression analysis
- Generates R scripts for reproducibility
- Useful for exploratory analysis before downloading data
无需编码即可快速分析,可使用GEO2R:
- 集成在GEO中的基于网络的统计分析工具
- 访问地址:https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSExxxxx
- 执行差异表达分析
- 生成可复现的R脚本
- 适合在下载数据前进行探索性分析
Rate Limiting and Best Practices
速率限制与最佳实践
NCBI E-utilities Rate Limits:
- Without API key: 3 requests per second
- With API key: 10 requests per second
- Implement delays between requests: (no API key) or
time.sleep(0.34)(with API key)time.sleep(0.1)
FTP Access:
- No rate limits for FTP downloads
- Preferred method for bulk downloads
- Can download entire directories with wget -r
GEOparse Caching:
- GEOparse automatically caches downloaded files in destdir
- Subsequent calls use cached data
- Clean cache periodically to save disk space
Optimal Practices:
- Use GEOparse for series-level access (easiest)
- Use E-utilities for metadata searching and batch queries
- Use FTP for direct file downloads and bulk operations
- Cache data locally to avoid repeated downloads
- Always set Entrez.email when using Biopython
NCBI E-utilities速率限制:
- 无API密钥:每秒3次请求
- 有API密钥:每秒10次请求
- 在请求之间实现延迟:(无API密钥)或
time.sleep(0.34)(有API密钥)time.sleep(0.1)
FTP访问:
- FTP下载无速率限制
- 批量下载的首选方法
- 可使用wget -r下载整个目录
GEOparse缓存:
- GEOparse会自动将下载的文件缓存到destdir中
- 后续调用会使用缓存数据
- 定期清理缓存以节省磁盘空间
最佳实践:
- 使用GEOparse进行系列级访问(最简单)
- 使用E-utilities进行元数据搜索和批量查询
- 使用FTP进行直接文件下载和批量操作
- 本地缓存数据以避免重复下载
- 使用Biopython时始终设置Entrez.email
Resources
资源
references/geo_reference.md
references/geo_reference.md
Comprehensive reference documentation covering:
- Detailed E-utilities API specifications and endpoints
- Complete SOFT and MINiML file format documentation
- Advanced GEOparse usage patterns and examples
- FTP directory structure and file naming conventions
- Data processing pipelines and normalization methods
- Troubleshooting common issues and error handling
- Platform-specific considerations and quirks
Consult this reference for in-depth technical details, complex query patterns, or when working with uncommon data formats.
全面的参考文档,涵盖:
- 详细的E-utilities API规范和端点
- 完整的SOFT和MINiML文件格式文档
- 高级GEOparse使用模式和示例
- FTP目录结构和文件命名约定
- 数据处理流水线和归一化方法
- 常见问题排查和错误处理
- 平台特定注意事项和特殊情况
如需深入技术细节、复杂查询模式或处理不常见数据格式,请参考此文档。
Important Notes
重要说明
Data Quality Considerations
数据质量注意事项
- GEO accepts user-submitted data with varying quality standards
- Always check platform annotation and processing methods
- Verify sample metadata and experimental design
- Be cautious with batch effects across studies
- Consider reprocessing raw data for consistency
- GEO接受用户提交的数据,质量标准各不相同
- 始终检查平台注释和处理方法
- 验证样本元数据和实验设计
- 注意跨研究的批次效应
- 考虑重新处理原始数据以保证一致性
File Size Warnings
文件大小警告
- Series matrix files can be large (>1 GB for large studies)
- Supplementary files (e.g., CEL files) can be very large
- Plan for adequate disk space before downloading
- Consider downloading samples incrementally
- 系列矩阵文件可能很大(大型研究超过1 GB)
- 补充文件(如CEL文件)可能非常大
- 下载前确保有足够的磁盘空间
- 考虑增量下载样本
Data Usage and Citation
数据使用与引用
- GEO data is freely available for research use
- Always cite original studies when using GEO data
- Cite GEO database: Barrett et al. (2013) Nucleic Acids Research
- Check individual dataset usage restrictions (if any)
- Follow NCBI guidelines for programmatic access
- GEO数据可免费用于研究
- 使用GEO数据时始终引用原始研究
- 引用GEO数据库:Barrett et al. (2013) Nucleic Acids Research
- 检查各个数据集的使用限制(如有)
- 遵循NCBI编程访问指南
Common Pitfalls
常见陷阱
- Different platforms use different probe IDs (requires annotation mapping)
- Expression values may be raw, normalized, or log-transformed (check metadata)
- Sample metadata can be inconsistently formatted across studies
- Not all series have series matrix files (older submissions)
- Platform annotations may be outdated (genes renamed, IDs deprecated)
- 不同平台使用不同的探针ID(需要注释映射)
- 表达值可能是原始信号、归一化值或对数转换值(请检查元数据)
- 样本元数据在不同研究中的格式可能不一致
- 并非所有系列都有系列矩阵文件(早期提交的研究)
- 平台注释可能过时(基因重命名、ID弃用)
Additional Resources
其他资源
- GEO Website: https://www.ncbi.nlm.nih.gov/geo/
- GEO Submission Guidelines: https://www.ncbi.nlm.nih.gov/geo/info/submission.html
- GEOparse Documentation: https://geoparse.readthedocs.io/
- E-utilities Documentation: https://www.ncbi.nlm.nih.gov/books/NBK25501/
- GEO FTP Site: ftp://ftp.ncbi.nlm.nih.gov/geo/
- GEO2R Tool: https://www.ncbi.nlm.nih.gov/geo/geo2r/
- NCBI API Keys: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
- Biopython Tutorial: https://biopython.org/DIST/docs/tutorial/Tutorial.html
- GEO官网: https://www.ncbi.nlm.nih.gov/geo/
- GEO提交指南: https://www.ncbi.nlm.nih.gov/geo/info/submission.html
- GEOparse文档: https://geoparse.readthedocs.io/
- E-utilities文档: https://www.ncbi.nlm.nih.gov/books/NBK25501/
- GEO FTP站点: ftp://ftp.ncbi.nlm.nih.gov/geo/
- GEO2R工具: https://www.ncbi.nlm.nih.gov/geo/geo2r/
- NCBI API密钥: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
- Biopython教程: https://biopython.org/DIST/docs/tutorial/Tutorial.html