geo-database

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GEO Database

GEO数据库

Overview

概述

The Gene Expression Omnibus (GEO) is NCBI's public repository for high-throughput gene expression and functional genomics data. GEO contains over 264,000 studies with more than 8 million samples from both array-based and sequence-based experiments.
基因表达综合数据库(GEO)是NCBI的高通量基因表达和功能基因组学数据公共存储库。GEO包含超过264,000项研究,涵盖来自基于阵列和基于测序实验的800多万个样本。

When to Use This Skill

何时使用该技能

This skill should be used when searching for gene expression datasets, retrieving experimental data, downloading raw and processed files, querying expression profiles, or integrating GEO data into computational analysis workflows.
当你需要搜索基因表达数据集、检索实验数据、下载原始和处理后的文件、查询表达谱,或者将GEO数据整合到计算分析工作流中时,应该使用该技能。

Core Capabilities

核心功能

1. Understanding GEO Data Organization

1. 了解GEO数据组织方式

GEO organizes data hierarchically using different accession types:
Series (GSE): A complete experiment with a set of related samples
  • Example: GSE123456
  • Contains experimental design, samples, and overall study information
  • Largest organizational unit in GEO
  • Current count: 264,928+ series
Sample (GSM): A single experimental sample or biological replicate
  • Example: GSM987654
  • Contains individual sample data, protocols, and metadata
  • Linked to platforms and series
  • Current count: 8,068,632+ samples
Platform (GPL): The microarray or sequencing platform used
  • Example: GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array)
  • Describes the technology and probe/feature annotations
  • Shared across multiple experiments
  • Current count: 27,739+ platforms
DataSet (GDS): Curated collections with consistent formatting
  • Example: GDS5678
  • Experimentally-comparable samples organized by study design
  • Processed for differential analysis
  • Subset of GEO data (4,348 curated datasets)
  • Ideal for quick comparative analyses
Profiles: Gene-specific expression data linked to sequence features
  • Queryable by gene name or annotation
  • Cross-references to Entrez Gene
  • Enables gene-centric searches across all studies
GEO使用不同的登录号类型对数据进行分层组织:
**系列(GSE):**包含一组相关样本的完整实验
  • 示例:GSE123456
  • 包含实验设计、样本和整体研究信息
  • GEO中最大的组织单元
  • 当前数量:264,928+个系列
**样本(GSM):**单个实验样本或生物学重复
  • 示例:GSM987654
  • 包含单个样本数据、实验方案和元数据
  • 与平台和系列相关联
  • 当前数量:8,068,632+个样本
**平台(GPL):**使用的微阵列或测序平台
  • 示例:GPL570(Affymetrix人类基因组U133 Plus 2.0阵列)
  • 描述技术和探针/特征注释
  • 可在多个实验间共享
  • 当前数量:27,739+个平台
**数据集(GDS):**格式一致的 curated 集合
  • 示例:GDS5678
  • 按研究设计组织的具有实验可比性的样本
  • 经过处理可用于差异分析
  • GEO数据的子集(4,348个 curated 数据集)
  • 非常适合快速比较分析
**表达谱:**与序列特征关联的基因特异性表达数据
  • 可通过基因名称或注释进行查询
  • 与Entrez Gene交叉引用
  • 支持跨所有研究的基因中心搜索

2. Searching GEO Data

2. 搜索GEO数据

GEO DataSets Search:
Search for studies by keywords, organism, or experimental conditions:
python
from Bio import Entrez
GEO数据集搜索:
按关键词、生物或实验条件搜索研究:
python
from Bio import Entrez

Configure Entrez (required)

Configure Entrez (required)

Entrez.email = "your.email@example.com"
Entrez.email = "your.email@example.com"

Search for datasets

Search for datasets

def search_geo_datasets(query, retmax=20): """Search GEO DataSets database""" handle = Entrez.esearch( db="gds", term=query, retmax=retmax, usehistory="y" ) results = Entrez.read(handle) handle.close() return results
def search_geo_datasets(query, retmax=20): """Search GEO DataSets database""" handle = Entrez.esearch( db="gds", term=query, retmax=retmax, usehistory="y" ) results = Entrez.read(handle) handle.close() return results

Example searches

Example searches

results = search_geo_datasets("breast cancer[MeSH] AND Homo sapiens[Organism]") print(f"Found {results['Count']} datasets")
results = search_geo_datasets("breast cancer[MeSH] AND Homo sapiens[Organism]") print(f"Found {results['Count']} datasets")

Search by specific platform

Search by specific platform

results = search_geo_datasets("GPL570[Accession]")
results = search_geo_datasets("GPL570[Accession]")

Search by study type

Search by study type

results = search_geo_datasets("expression profiling by array[DataSet Type]")

**GEO Profiles Search:**

Find gene-specific expression patterns:

```python
results = search_geo_datasets("expression profiling by array[DataSet Type]")

**GEO表达谱搜索:**

查找基因特异性表达模式:

```python

Search for gene expression profiles

Search for gene expression profiles

def search_geo_profiles(gene_name, organism="Homo sapiens", retmax=100): """Search GEO Profiles for a specific gene""" query = f"{gene_name}[Gene Name] AND {organism}[Organism]" handle = Entrez.esearch( db="geoprofiles", term=query, retmax=retmax ) results = Entrez.read(handle) handle.close() return results
def search_geo_profiles(gene_name, organism="Homo sapiens", retmax=100): """Search GEO Profiles for a specific gene""" query = f"{gene_name}[Gene Name] AND {organism}[Organism]" handle = Entrez.esearch( db="geoprofiles", term=query, retmax=retmax ) results = Entrez.read(handle) handle.close() return results

Find TP53 expression across studies

Find TP53 expression across studies

tp53_results = search_geo_profiles("TP53", organism="Homo sapiens") print(f"Found {tp53_results['Count']} expression profiles for TP53")

**Advanced Search Patterns:**

```python
tp53_results = search_geo_profiles("TP53", organism="Homo sapiens") print(f"Found {tp53_results['Count']} expression profiles for TP53")

**高级搜索模式:**

```python

Combine multiple search terms

Combine multiple search terms

def advanced_geo_search(terms, operator="AND"): """Build complex search queries""" query = f" {operator} ".join(terms) return search_geo_datasets(query)
def advanced_geo_search(terms, operator="AND"): """Build complex search queries""" query = f" {operator} ".join(terms) return search_geo_datasets(query)

Find recent high-throughput studies

Find recent high-throughput studies

search_terms = [ "RNA-seq[DataSet Type]", "Homo sapiens[Organism]", "2024[Publication Date]" ] results = advanced_geo_search(search_terms)
search_terms = [ "RNA-seq[DataSet Type]", "Homo sapiens[Organism]", "2024[Publication Date]" ] results = advanced_geo_search(search_terms)

Search by author and condition

Search by author and condition

search_terms = [ "Smith[Author]", "diabetes[Disease]" ] results = advanced_geo_search(search_terms)
undefined
search_terms = [ "Smith[Author]", "diabetes[Disease]" ] results = advanced_geo_search(search_terms)
undefined

3. Retrieving GEO Data with GEOparse (Recommended)

3. 使用GEOparse检索GEO数据(推荐)

GEOparse is the primary Python library for accessing GEO data:
Installation:
bash
uv pip install GEOparse
Basic Usage:
python
import GEOparse
GEOparse是访问GEO数据的主要Python库:
安装:
bash
uv pip install GEOparse
基本用法:
python
import GEOparse

Download and parse a GEO Series

Download and parse a GEO Series

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

Access series metadata

Access series metadata

print(gse.metadata['title']) print(gse.metadata['summary']) print(gse.metadata['overall_design'])
print(gse.metadata['title']) print(gse.metadata['summary']) print(gse.metadata['overall_design'])

Access sample information

Access sample information

for gsm_name, gsm in gse.gsms.items(): print(f"Sample: {gsm_name}") print(f" Title: {gsm.metadata['title'][0]}") print(f" Source: {gsm.metadata['source_name_ch1'][0]}") print(f" Characteristics: {gsm.metadata.get('characteristics_ch1', [])}")
for gsm_name, gsm in gse.gsms.items(): print(f"Sample: {gsm_name}") print(f" Title: {gsm.metadata['title'][0]}") print(f" Source: {gsm.metadata['source_name_ch1'][0]}") print(f" Characteristics: {gsm.metadata.get('characteristics_ch1', [])}")

Access platform information

Access platform information

for gpl_name, gpl in gse.gpls.items(): print(f"Platform: {gpl_name}") print(f" Title: {gpl.metadata['title'][0]}") print(f" Organism: {gpl.metadata['organism'][0]}")

**Working with Expression Data:**

```python
import GEOparse
import pandas as pd
for gpl_name, gpl in gse.gpls.items(): print(f"Platform: {gpl_name}") print(f" Title: {gpl.metadata['title'][0]}") print(f" Organism: {gpl.metadata['organism'][0]}")

**处理表达数据:**

```python
import GEOparse
import pandas as pd

Get expression data from series

Get expression data from series

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

Extract expression matrix

Extract expression matrix

Method 1: From series matrix file (fastest)

Method 1: From series matrix file (fastest)

if hasattr(gse, 'pivot_samples'): expression_df = gse.pivot_samples('VALUE') print(expression_df.shape) # genes x samples
if hasattr(gse, 'pivot_samples'): expression_df = gse.pivot_samples('VALUE') print(expression_df.shape) # genes x samples

Method 2: From individual samples

Method 2: From individual samples

expression_data = {} for gsm_name, gsm in gse.gsms.items(): if hasattr(gsm, 'table'): expression_data[gsm_name] = gsm.table['VALUE']
expression_df = pd.DataFrame(expression_data) print(f"Expression matrix: {expression_df.shape}")

**Accessing Supplementary Files:**

```python
import GEOparse

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_data = {} for gsm_name, gsm in gse.gsms.items(): if hasattr(gsm, 'table'): expression_data[gsm_name] = gsm.table['VALUE']
expression_df = pd.DataFrame(expression_data) print(f"Expression matrix: {expression_df.shape}")

**访问补充文件:**

```python
import GEOparse

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

Download supplementary files

Download supplementary files

gse.download_supplementary_files( directory="./data/GSE123456_suppl", download_sra=False # Set to True to download SRA files )
gse.download_supplementary_files( directory="./data/GSE123456_suppl", download_sra=False # Set to True to download SRA files )

List available supplementary files

List available supplementary files

for gsm_name, gsm in gse.gsms.items(): if hasattr(gsm, 'supplementary_files'): print(f"Sample {gsm_name}:") for file_url in gsm.metadata.get('supplementary_file', []): print(f" {file_url}")

**Filtering and Subsetting Data:**

```python
import GEOparse

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
for gsm_name, gsm in gse.gsms.items(): if hasattr(gsm, 'supplementary_files'): print(f"Sample {gsm_name}:") for file_url in gsm.metadata.get('supplementary_file', []): print(f" {file_url}")

**过滤和子集化数据:**

```python
import GEOparse

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

Filter samples by metadata

Filter samples by metadata

control_samples = [ gsm_name for gsm_name, gsm in gse.gsms.items() if 'control' in gsm.metadata.get('title', [''])[0].lower() ]
treatment_samples = [ gsm_name for gsm_name, gsm in gse.gsms.items() if 'treatment' in gsm.metadata.get('title', [''])[0].lower() ]
print(f"Control samples: {len(control_samples)}") print(f"Treatment samples: {len(treatment_samples)}")
control_samples = [ gsm_name for gsm_name, gsm in gse.gsms.items() if 'control' in gsm.metadata.get('title', [''])[0].lower() ]
treatment_samples = [ gsm_name for gsm_name, gsm in gse.gsms.items() if 'treatment' in gsm.metadata.get('title', [''])[0].lower() ]
print(f"Control samples: {len(control_samples)}") print(f"Treatment samples: {len(treatment_samples)}")

Extract subset expression matrix

Extract subset expression matrix

expression_df = gse.pivot_samples('VALUE') control_expr = expression_df[control_samples] treatment_expr = expression_df[treatment_samples]
undefined
expression_df = gse.pivot_samples('VALUE') control_expr = expression_df[control_samples] treatment_expr = expression_df[treatment_samples]
undefined

4. Using NCBI E-utilities for GEO Access

4. 使用NCBI E-utilities访问GEO

E-utilities provide lower-level programmatic access to GEO metadata:
Basic E-utilities Workflow:
python
from Bio import Entrez
import time

Entrez.email = "your.email@example.com"
E-utilities提供对GEO元数据的底层编程访问:
基本E-utilities工作流:
python
from Bio import Entrez
import time

Entrez.email = "your.email@example.com"

Step 1: Search for GEO entries

Step 1: Search for GEO entries

def search_geo(query, db="gds", retmax=100): """Search GEO using E-utilities""" handle = Entrez.esearch( db=db, term=query, retmax=retmax, usehistory="y" ) results = Entrez.read(handle) handle.close() return results
def search_geo(query, db="gds", retmax=100): """Search GEO using E-utilities""" handle = Entrez.esearch( db=db, term=query, retmax=retmax, usehistory="y" ) results = Entrez.read(handle) handle.close() return results

Step 2: Fetch summaries

Step 2: Fetch summaries

def fetch_geo_summaries(id_list, db="gds"): """Fetch document summaries for GEO entries""" ids = ",".join(id_list) handle = Entrez.esummary(db=db, id=ids) summaries = Entrez.read(handle) handle.close() return summaries
def fetch_geo_summaries(id_list, db="gds"): """Fetch document summaries for GEO entries""" ids = ",".join(id_list) handle = Entrez.esummary(db=db, id=ids) summaries = Entrez.read(handle) handle.close() return summaries

Step 3: Fetch full records

Step 3: Fetch full records

def fetch_geo_records(id_list, db="gds"): """Fetch full GEO records""" ids = ",".join(id_list) handle = Entrez.efetch(db=db, id=ids, retmode="xml") records = Entrez.read(handle) handle.close() return records
def fetch_geo_records(id_list, db="gds"): """Fetch full GEO records""" ids = ",".join(id_list) handle = Entrez.efetch(db=db, id=ids, retmode="xml") records = Entrez.read(handle) handle.close() return records

Example workflow

Example workflow

search_results = search_geo("breast cancer AND Homo sapiens") id_list = search_results['IdList'][:5]
summaries = fetch_geo_summaries(id_list) for summary in summaries: print(f"GDS: {summary.get('Accession', 'N/A')}") print(f"Title: {summary.get('title', 'N/A')}") print(f"Samples: {summary.get('n_samples', 'N/A')}") print()

**Batch Processing with E-utilities:**

```python
from Bio import Entrez
import time

Entrez.email = "your.email@example.com"

def batch_fetch_geo_metadata(accessions, batch_size=100):
    """Fetch metadata for multiple GEO accessions"""
    results = {}

    for i in range(0, len(accessions), batch_size):
        batch = accessions[i:i + batch_size]

        # Search for each accession
        for accession in batch:
            try:
                query = f"{accession}[Accession]"
                search_handle = Entrez.esearch(db="gds", term=query)
                search_results = Entrez.read(search_handle)
                search_handle.close()

                if search_results['IdList']:
                    # Fetch summary
                    summary_handle = Entrez.esummary(
                        db="gds",
                        id=search_results['IdList'][0]
                    )
                    summary = Entrez.read(summary_handle)
                    summary_handle.close()
                    results[accession] = summary[0]

                # Be polite to NCBI servers
                time.sleep(0.34)  # Max 3 requests per second

            except Exception as e:
                print(f"Error fetching {accession}: {e}")

    return results
search_results = search_geo("breast cancer AND Homo sapiens") id_list = search_results['IdList'][:5]
summaries = fetch_geo_summaries(id_list) for summary in summaries: print(f"GDS: {summary.get('Accession', 'N/A')}") print(f"Title: {summary.get('title', 'N/A')}") print(f"Samples: {summary.get('n_samples', 'N/A')}") print()

**使用E-utilities进行批量处理:**

```python
from Bio import Entrez
import time

Entrez.email = "your.email@example.com"

def batch_fetch_geo_metadata(accessions, batch_size=100):
    """Fetch metadata for multiple GEO accessions"""
    results = {}

    for i in range(0, len(accessions), batch_size):
        batch = accessions[i:i + batch_size]

        # Search for each accession
        for accession in batch:
            try:
                query = f"{accession}[Accession]"
                search_handle = Entrez.esearch(db="gds", term=query)
                search_results = Entrez.read(search_handle)
                search_handle.close()

                if search_results['IdList']:
                    # Fetch summary
                    summary_handle = Entrez.esummary(
                        db="gds",
                        id=search_results['IdList'][0]
                    )
                    summary = Entrez.read(summary_handle)
                    summary_handle.close()
                    results[accession] = summary[0]

                # Be polite to NCBI servers
                time.sleep(0.34)  # Max 3 requests per second

            except Exception as e:
                print(f"Error fetching {accession}: {e}")

    return results

Fetch metadata for multiple datasets

Fetch metadata for multiple datasets

gse_list = ["GSE100001", "GSE100002", "GSE100003"] metadata = batch_fetch_geo_metadata(gse_list)
undefined
gse_list = ["GSE100001", "GSE100002", "GSE100003"] metadata = batch_fetch_geo_metadata(gse_list)
undefined

5. Direct FTP Access for Data Files

5. 直接通过FTP访问数据文件

FTP URLs for GEO Data:
GEO data can be downloaded directly via FTP:
python
import ftplib
import os

def download_geo_ftp(accession, file_type="matrix", dest_dir="./data"):
    """Download GEO files via FTP"""
    # Construct FTP path based on accession type
    if accession.startswith("GSE"):
        # Series files
        gse_num = accession[3:]
        base_num = gse_num[:-3] + "nnn"
        ftp_path = f"/geo/series/GSE{base_num}/{accession}/"

        if file_type == "matrix":
            filename = f"{accession}_series_matrix.txt.gz"
        elif file_type == "soft":
            filename = f"{accession}_family.soft.gz"
        elif file_type == "miniml":
            filename = f"{accession}_family.xml.tgz"

    # Connect to FTP server
    ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov")
    ftp.login()
    ftp.cwd(ftp_path)

    # Download file
    os.makedirs(dest_dir, exist_ok=True)
    local_file = os.path.join(dest_dir, filename)

    with open(local_file, 'wb') as f:
        ftp.retrbinary(f'RETR {filename}', f.write)

    ftp.quit()
    print(f"Downloaded: {local_file}")
    return local_file
GEO数据的FTP地址:
可通过FTP直接下载GEO数据:
python
import ftplib
import os

def download_geo_ftp(accession, file_type="matrix", dest_dir="./data"):
    """Download GEO files via FTP"""
    # Construct FTP path based on accession type
    if accession.startswith("GSE"):
        # Series files
        gse_num = accession[3:]
        base_num = gse_num[:-3] + "nnn"
        ftp_path = f"/geo/series/GSE{base_num}/{accession}/"

        if file_type == "matrix":
            filename = f"{accession}_series_matrix.txt.gz"
        elif file_type == "soft":
            filename = f"{accession}_family.soft.gz"
        elif file_type == "miniml":
            filename = f"{accession}_family.xml.tgz"

    # Connect to FTP server
    ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov")
    ftp.login()
    ftp.cwd(ftp_path)

    # Download file
    os.makedirs(dest_dir, exist_ok=True)
    local_file = os.path.join(dest_dir, filename)

    with open(local_file, 'wb') as f:
        ftp.retrbinary(f'RETR {filename}', f.write)

    ftp.quit()
    print(f"Downloaded: {local_file}")
    return local_file

Download series matrix file

Download series matrix file

download_geo_ftp("GSE123456", file_type="matrix")
download_geo_ftp("GSE123456", file_type="matrix")

Download SOFT format file

Download SOFT format file

download_geo_ftp("GSE123456", file_type="soft")

**Using wget or curl for Downloads:**

```bash
download_geo_ftp("GSE123456", file_type="soft")

**使用wget或curl下载:**

```bash

Download series matrix file

Download series matrix file

wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/matrix/GSE123456_series_matrix.txt.gz
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/matrix/GSE123456_series_matrix.txt.gz

Download all supplementary files for a series

Download all supplementary files for a series

wget -r -np -nd ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/suppl/
wget -r -np -nd ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/suppl/

Download SOFT format family file

Download SOFT format family file

wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/soft/GSE123456_family.soft.gz
undefined
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/soft/GSE123456_family.soft.gz
undefined

6. Analyzing GEO Data

6. 分析GEO数据

Quality Control and Preprocessing:
python
import GEOparse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
质量控制与预处理:
python
import GEOparse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Load dataset

Load dataset

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data") expression_df = gse.pivot_samples('VALUE')
gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data") expression_df = gse.pivot_samples('VALUE')

Check for missing values

Check for missing values

print(f"Missing values: {expression_df.isnull().sum().sum()}")
print(f"Missing values: {expression_df.isnull().sum().sum()}")

Log transformation (if needed)

Log transformation (if needed)

if expression_df.min().min() > 0: # Check if already log-transformed if expression_df.max().max() > 100: expression_df = np.log2(expression_df + 1) print("Applied log2 transformation")
if expression_df.min().min() > 0: # Check if already log-transformed if expression_df.max().max() > 100: expression_df = np.log2(expression_df + 1) print("Applied log2 transformation")

Distribution plots

Distribution plots

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1) expression_df.plot.box(ax=plt.gca()) plt.title("Expression Distribution per Sample") plt.xticks(rotation=90)
plt.subplot(1, 2, 2) expression_df.mean(axis=1).hist(bins=50) plt.title("Gene Expression Distribution") plt.xlabel("Average Expression")
plt.tight_layout() plt.savefig("geo_qc.png", dpi=300, bbox_inches='tight')

**Differential Expression Analysis:**

```python
import GEOparse
import pandas as pd
import numpy as np
from scipy import stats

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1) expression_df.plot.box(ax=plt.gca()) plt.title("Expression Distribution per Sample") plt.xticks(rotation=90)
plt.subplot(1, 2, 2) expression_df.mean(axis=1).hist(bins=50) plt.title("Gene Expression Distribution") plt.xlabel("Average Expression")
plt.tight_layout() plt.savefig("geo_qc.png", dpi=300, bbox_inches='tight')

**差异表达分析:**

```python
import GEOparse
import pandas as pd
import numpy as np
from scipy import stats

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')

Define sample groups

Define sample groups

control_samples = ["GSM1", "GSM2", "GSM3"] treatment_samples = ["GSM4", "GSM5", "GSM6"]
control_samples = ["GSM1", "GSM2", "GSM3"] treatment_samples = ["GSM4", "GSM5", "GSM6"]

Calculate fold changes and p-values

Calculate fold changes and p-values

results = [] for gene in expression_df.index: control_expr = expression_df.loc[gene, control_samples] treatment_expr = expression_df.loc[gene, treatment_samples]
# Calculate statistics
fold_change = treatment_expr.mean() - control_expr.mean()
t_stat, p_value = stats.ttest_ind(treatment_expr, control_expr)

results.append({
    'gene': gene,
    'log2_fold_change': fold_change,
    'p_value': p_value,
    'control_mean': control_expr.mean(),
    'treatment_mean': treatment_expr.mean()
})
results = [] for gene in expression_df.index: control_expr = expression_df.loc[gene, control_samples] treatment_expr = expression_df.loc[gene, treatment_samples]
# Calculate statistics
fold_change = treatment_expr.mean() - control_expr.mean()
t_stat, p_value = stats.ttest_ind(treatment_expr, control_expr)

results.append({
    'gene': gene,
    'log2_fold_change': fold_change,
    'p_value': p_value,
    'control_mean': control_expr.mean(),
    'treatment_mean': treatment_expr.mean()
})

Create results DataFrame

Create results DataFrame

de_results = pd.DataFrame(results)
de_results = pd.DataFrame(results)

Multiple testing correction (Benjamini-Hochberg)

Multiple testing correction (Benjamini-Hochberg)

from statsmodels.stats.multitest import multipletests _, de_results['q_value'], _, _ = multipletests( de_results['p_value'], method='fdr_bh' )
from statsmodels.stats.multitest import multipletests _, de_results['q_value'], _, _ = multipletests( de_results['p_value'], method='fdr_bh' )

Filter significant genes

Filter significant genes

significant_genes = de_results[ (de_results['q_value'] < 0.05) & (abs(de_results['log2_fold_change']) > 1) ]
print(f"Significant genes: {len(significant_genes)}") significant_genes.to_csv("de_results.csv", index=False)

**Correlation and Clustering Analysis:**

```python
import GEOparse
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')
significant_genes = de_results[ (de_results['q_value'] < 0.05) & (abs(de_results['log2_fold_change']) > 1) ]
print(f"Significant genes: {len(significant_genes)}") significant_genes.to_csv("de_results.csv", index=False)

**相关性与聚类分析:**

```python
import GEOparse
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')

Sample correlation heatmap

Sample correlation heatmap

sample_corr = expression_df.corr()
plt.figure(figsize=(10, 8)) sns.heatmap(sample_corr, cmap='coolwarm', center=0, square=True, linewidths=0.5) plt.title("Sample Correlation Matrix") plt.tight_layout() plt.savefig("sample_correlation.png", dpi=300, bbox_inches='tight')
sample_corr = expression_df.corr()
plt.figure(figsize=(10, 8)) sns.heatmap(sample_corr, cmap='coolwarm', center=0, square=True, linewidths=0.5) plt.title("Sample Correlation Matrix") plt.tight_layout() plt.savefig("sample_correlation.png", dpi=300, bbox_inches='tight')

Hierarchical clustering

Hierarchical clustering

distances = pdist(expression_df.T, metric='correlation') linkage = hierarchy.linkage(distances, method='average')
plt.figure(figsize=(12, 6)) hierarchy.dendrogram(linkage, labels=expression_df.columns) plt.title("Hierarchical Clustering of Samples") plt.xlabel("Samples") plt.ylabel("Distance") plt.xticks(rotation=90) plt.tight_layout() plt.savefig("sample_clustering.png", dpi=300, bbox_inches='tight')
undefined
distances = pdist(expression_df.T, metric='correlation') linkage = hierarchy.linkage(distances, method='average')
plt.figure(figsize=(12, 6)) hierarchy.dendrogram(linkage, labels=expression_df.columns) plt.title("Hierarchical Clustering of Samples") plt.xlabel("Samples") plt.ylabel("Distance") plt.xticks(rotation=90) plt.tight_layout() plt.savefig("sample_clustering.png", dpi=300, bbox_inches='tight')
undefined

7. Batch Processing Multiple Datasets

7. 批量处理多个数据集

Download and Process Multiple Series:
python
import GEOparse
import pandas as pd
import os

def batch_download_geo(gse_list, destdir="./geo_data"):
    """Download multiple GEO series"""
    results = {}

    for gse_id in gse_list:
        try:
            print(f"Processing {gse_id}...")
            gse = GEOparse.get_GEO(geo=gse_id, destdir=destdir)

            # Extract key information
            results[gse_id] = {
                'title': gse.metadata.get('title', ['N/A'])[0],
                'organism': gse.metadata.get('organism', ['N/A'])[0],
                'platform': list(gse.gpls.keys())[0] if gse.gpls else 'N/A',
                'num_samples': len(gse.gsms),
                'submission_date': gse.metadata.get('submission_date', ['N/A'])[0]
            }

            # Save expression data
            if hasattr(gse, 'pivot_samples'):
                expr_df = gse.pivot_samples('VALUE')
                expr_df.to_csv(f"{destdir}/{gse_id}_expression.csv")
                results[gse_id]['num_genes'] = len(expr_df)

        except Exception as e:
            print(f"Error processing {gse_id}: {e}")
            results[gse_id] = {'error': str(e)}

    # Save summary
    summary_df = pd.DataFrame(results).T
    summary_df.to_csv(f"{destdir}/batch_summary.csv")

    return results
下载并处理多个系列:
python
import GEOparse
import pandas as pd
import os

def batch_download_geo(gse_list, destdir="./geo_data"):
    """Download multiple GEO series"""
    results = {}

    for gse_id in gse_list:
        try:
            print(f"Processing {gse_id}...")
            gse = GEOparse.get_GEO(geo=gse_id, destdir=destdir)

            # Extract key information
            results[gse_id] = {
                'title': gse.metadata.get('title', ['N/A'])[0],
                'organism': gse.metadata.get('organism', ['N/A'])[0],
                'platform': list(gse.gpls.keys())[0] if gse.gpls else 'N/A',
                'num_samples': len(gse.gsms),
                'submission_date': gse.metadata.get('submission_date', ['N/A'])[0]
            }

            # Save expression data
            if hasattr(gse, 'pivot_samples'):
                expr_df = gse.pivot_samples('VALUE')
                expr_df.to_csv(f"{destdir}/{gse_id}_expression.csv")
                results[gse_id]['num_genes'] = len(expr_df)

        except Exception as e:
            print(f"Error processing {gse_id}: {e}")
            results[gse_id] = {'error': str(e)}

    # Save summary
    summary_df = pd.DataFrame(results).T
    summary_df.to_csv(f"{destdir}/batch_summary.csv")

    return results

Process multiple datasets

Process multiple datasets

gse_list = ["GSE100001", "GSE100002", "GSE100003"] results = batch_download_geo(gse_list)

**Meta-Analysis Across Studies:**

```python
import GEOparse
import pandas as pd
import numpy as np

def meta_analysis_geo(gse_list, gene_of_interest):
    """Perform meta-analysis of gene expression across studies"""
    results = []

    for gse_id in gse_list:
        try:
            gse = GEOparse.get_GEO(geo=gse_id, destdir="./data")

            # Get platform annotation
            gpl = list(gse.gpls.values())[0]

            # Find gene in platform
            if hasattr(gpl, 'table'):
                gene_probes = gpl.table[
                    gpl.table['Gene Symbol'].str.contains(
                        gene_of_interest,
                        case=False,
                        na=False
                    )
                ]

                if not gene_probes.empty:
                    expr_df = gse.pivot_samples('VALUE')

                    for probe_id in gene_probes['ID']:
                        if probe_id in expr_df.index:
                            expr_values = expr_df.loc[probe_id]

                            results.append({
                                'study': gse_id,
                                'probe': probe_id,
                                'mean_expression': expr_values.mean(),
                                'std_expression': expr_values.std(),
                                'num_samples': len(expr_values)
                            })

        except Exception as e:
            print(f"Error in {gse_id}: {e}")

    return pd.DataFrame(results)
gse_list = ["GSE100001", "GSE100002", "GSE100003"] results = batch_download_geo(gse_list)

**跨研究的荟萃分析:**

```python
import GEOparse
import pandas as pd
import numpy as np

def meta_analysis_geo(gse_list, gene_of_interest):
    """Perform meta-analysis of gene expression across studies"""
    results = []

    for gse_id in gse_list:
        try:
            gse = GEOparse.get_GEO(geo=gse_id, destdir="./data")

            # Get platform annotation
            gpl = list(gse.gpls.values())[0]

            # Find gene in platform
            if hasattr(gpl, 'table'):
                gene_probes = gpl.table[
                    gpl.table['Gene Symbol'].str.contains(
                        gene_of_interest,
                        case=False,
                        na=False
                    )
                ]

                if not gene_probes.empty:
                    expr_df = gse.pivot_samples('VALUE')

                    for probe_id in gene_probes['ID']:
                        if probe_id in expr_df.index:
                            expr_values = expr_df.loc[probe_id]

                            results.append({
                                'study': gse_id,
                                'probe': probe_id,
                                'mean_expression': expr_values.mean(),
                                'std_expression': expr_values.std(),
                                'num_samples': len(expr_values)
                            })

        except Exception as e:
            print(f"Error in {gse_id}: {e}")

    return pd.DataFrame(results)

Meta-analysis for TP53

Meta-analysis for TP53

gse_studies = ["GSE100001", "GSE100002", "GSE100003"] meta_results = meta_analysis_geo(gse_studies, "TP53") print(meta_results)
undefined
gse_studies = ["GSE100001", "GSE100002", "GSE100003"] meta_results = meta_analysis_geo(gse_studies, "TP53") print(meta_results)
undefined

Installation and Setup

安装与设置

Python Libraries

Python库

bash
undefined
bash
undefined

Primary GEO access library (recommended)

Primary GEO access library (recommended)

uv pip install GEOparse
uv pip install GEOparse

For E-utilities and programmatic NCBI access

For E-utilities and programmatic NCBI access

uv pip install biopython
uv pip install biopython

For data analysis

For data analysis

uv pip install pandas numpy scipy
uv pip install pandas numpy scipy

For visualization

For visualization

uv pip install matplotlib seaborn
uv pip install matplotlib seaborn

For statistical analysis

For statistical analysis

uv pip install statsmodels scikit-learn
undefined
uv pip install statsmodels scikit-learn
undefined

Configuration

配置

Set up NCBI E-utilities access:
python
from Bio import Entrez
设置NCBI E-utilities访问:
python
from Bio import Entrez

Always set your email (required by NCBI)

Always set your email (required by NCBI)

Entrez.email = "your.email@example.com"
Entrez.email = "your.email@example.com"

Optional: Set API key for increased rate limits

Optional: Set API key for increased rate limits

Entrez.api_key = "your_api_key_here"
Entrez.api_key = "your_api_key_here"

With API key: 10 requests/second

With API key: 10 requests/second

Without API key: 3 requests/second

Without API key: 3 requests/second

undefined
undefined

Common Use Cases

常见用例

Transcriptomics Research

转录组学研究

  • Download gene expression data for specific conditions
  • Compare expression profiles across studies
  • Identify differentially expressed genes
  • Perform meta-analyses across multiple datasets
  • 下载特定条件下的基因表达数据
  • 跨研究比较表达谱
  • 鉴定差异表达基因
  • 跨多个数据集进行荟萃分析

Drug Response Studies

药物反应研究

  • Analyze gene expression changes after drug treatment
  • Identify biomarkers for drug response
  • Compare drug effects across cell lines or patients
  • Build predictive models for drug sensitivity
  • 分析药物处理后的基因表达变化
  • 鉴定药物反应的生物标志物
  • 跨细胞系或患者比较药物效果
  • 构建药物敏感性预测模型

Disease Biology

疾病生物学

  • Study gene expression in disease vs. normal tissues
  • Identify disease-associated expression signatures
  • Compare patient subgroups and disease stages
  • Correlate expression with clinical outcomes
  • 研究疾病与正常组织中的基因表达
  • 鉴定疾病相关的表达特征
  • 比较患者亚组和疾病阶段
  • 将表达与临床结果关联

Biomarker Discovery

生物标志物发现

  • Screen for diagnostic or prognostic markers
  • Validate biomarkers across independent cohorts
  • Compare marker performance across platforms
  • Integrate expression with clinical data
  • 筛选诊断或预后标志物
  • 在独立队列中验证生物标志物
  • 跨平台比较标志物性能
  • 将表达与临床数据整合

Key Concepts

关键概念

SOFT (Simple Omnibus Format in Text): GEO's primary text-based format containing metadata and data tables. Easily parsed by GEOparse.
MINiML (MIAME Notation in Markup Language): XML format for GEO data, used for programmatic access and data exchange.
Series Matrix: Tab-delimited expression matrix with samples as columns and genes/probes as rows. Fastest format for getting expression data.
MIAME Compliance: Minimum Information About a Microarray Experiment - standardized annotation that GEO enforces for all submissions.
Expression Value Types: Different types of expression measurements (raw signal, normalized, log-transformed). Always check platform and processing methods.
Platform Annotation: Maps probe/feature IDs to genes. Essential for biological interpretation of expression data.
**SOFT(Simple Omnibus Format in Text):**GEO的主要文本格式,包含元数据和数据表。可通过GEOparse轻松解析。
**MINiML(MIAME Notation in Markup Language):**GEO数据的XML格式,用于编程访问和数据交换。
**系列矩阵:**制表符分隔的表达矩阵,样本为列,基因/探针为行。获取表达数据的最快格式。
**MIAME合规性:**微阵列实验最小信息标准 - GEO对所有提交内容强制执行的标准化注释要求。
**表达值类型:**不同类型的表达测量值(原始信号、归一化值、对数转换值)。请始终检查平台和处理方法。
**平台注释:**将探针/特征ID映射到基因。对表达数据的生物学解释至关重要。

GEO2R Web Tool

GEO2R网络工具

For quick analysis without coding, use GEO2R:
  • Web-based statistical analysis tool integrated into GEO
  • Accessible at: https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSExxxxx
  • Performs differential expression analysis
  • Generates R scripts for reproducibility
  • Useful for exploratory analysis before downloading data
无需编码即可快速分析,可使用GEO2R:

Rate Limiting and Best Practices

速率限制与最佳实践

NCBI E-utilities Rate Limits:
  • Without API key: 3 requests per second
  • With API key: 10 requests per second
  • Implement delays between requests:
    time.sleep(0.34)
    (no API key) or
    time.sleep(0.1)
    (with API key)
FTP Access:
  • No rate limits for FTP downloads
  • Preferred method for bulk downloads
  • Can download entire directories with wget -r
GEOparse Caching:
  • GEOparse automatically caches downloaded files in destdir
  • Subsequent calls use cached data
  • Clean cache periodically to save disk space
Optimal Practices:
  • Use GEOparse for series-level access (easiest)
  • Use E-utilities for metadata searching and batch queries
  • Use FTP for direct file downloads and bulk operations
  • Cache data locally to avoid repeated downloads
  • Always set Entrez.email when using Biopython
NCBI E-utilities速率限制:
  • 无API密钥:每秒3次请求
  • 有API密钥:每秒10次请求
  • 在请求之间实现延迟:
    time.sleep(0.34)
    (无API密钥)或
    time.sleep(0.1)
    (有API密钥)
FTP访问:
  • FTP下载无速率限制
  • 批量下载的首选方法
  • 可使用wget -r下载整个目录
GEOparse缓存:
  • GEOparse会自动将下载的文件缓存到destdir中
  • 后续调用会使用缓存数据
  • 定期清理缓存以节省磁盘空间
最佳实践:
  • 使用GEOparse进行系列级访问(最简单)
  • 使用E-utilities进行元数据搜索和批量查询
  • 使用FTP进行直接文件下载和批量操作
  • 本地缓存数据以避免重复下载
  • 使用Biopython时始终设置Entrez.email

Resources

资源

references/geo_reference.md

references/geo_reference.md

Comprehensive reference documentation covering:
  • Detailed E-utilities API specifications and endpoints
  • Complete SOFT and MINiML file format documentation
  • Advanced GEOparse usage patterns and examples
  • FTP directory structure and file naming conventions
  • Data processing pipelines and normalization methods
  • Troubleshooting common issues and error handling
  • Platform-specific considerations and quirks
Consult this reference for in-depth technical details, complex query patterns, or when working with uncommon data formats.
全面的参考文档,涵盖:
  • 详细的E-utilities API规范和端点
  • 完整的SOFT和MINiML文件格式文档
  • 高级GEOparse使用模式和示例
  • FTP目录结构和文件命名约定
  • 数据处理流水线和归一化方法
  • 常见问题排查和错误处理
  • 平台特定注意事项和特殊情况
如需深入技术细节、复杂查询模式或处理不常见数据格式,请参考此文档。

Important Notes

重要说明

Data Quality Considerations

数据质量注意事项

  • GEO accepts user-submitted data with varying quality standards
  • Always check platform annotation and processing methods
  • Verify sample metadata and experimental design
  • Be cautious with batch effects across studies
  • Consider reprocessing raw data for consistency
  • GEO接受用户提交的数据,质量标准各不相同
  • 始终检查平台注释和处理方法
  • 验证样本元数据和实验设计
  • 注意跨研究的批次效应
  • 考虑重新处理原始数据以保证一致性

File Size Warnings

文件大小警告

  • Series matrix files can be large (>1 GB for large studies)
  • Supplementary files (e.g., CEL files) can be very large
  • Plan for adequate disk space before downloading
  • Consider downloading samples incrementally
  • 系列矩阵文件可能很大(大型研究超过1 GB)
  • 补充文件(如CEL文件)可能非常大
  • 下载前确保有足够的磁盘空间
  • 考虑增量下载样本

Data Usage and Citation

数据使用与引用

  • GEO data is freely available for research use
  • Always cite original studies when using GEO data
  • Cite GEO database: Barrett et al. (2013) Nucleic Acids Research
  • Check individual dataset usage restrictions (if any)
  • Follow NCBI guidelines for programmatic access
  • GEO数据可免费用于研究
  • 使用GEO数据时始终引用原始研究
  • 引用GEO数据库:Barrett et al. (2013) Nucleic Acids Research
  • 检查各个数据集的使用限制(如有)
  • 遵循NCBI编程访问指南

Common Pitfalls

常见陷阱

  • Different platforms use different probe IDs (requires annotation mapping)
  • Expression values may be raw, normalized, or log-transformed (check metadata)
  • Sample metadata can be inconsistently formatted across studies
  • Not all series have series matrix files (older submissions)
  • Platform annotations may be outdated (genes renamed, IDs deprecated)
  • 不同平台使用不同的探针ID(需要注释映射)
  • 表达值可能是原始信号、归一化值或对数转换值(请检查元数据)
  • 样本元数据在不同研究中的格式可能不一致
  • 并非所有系列都有系列矩阵文件(早期提交的研究)
  • 平台注释可能过时(基因重命名、ID弃用)

Additional Resources

其他资源