GEO Database

GEO数据库

Overview

概述

The Gene Expression Omnibus (GEO) is NCBI's public repository for high-throughput gene expression and functional genomics data. GEO contains over 264,000 studies with more than 8 million samples from both array-based and sequence-based experiments.

基因表达综合数据库（GEO）是NCBI的高通量基因表达和功能基因组学数据公共存储库。GEO包含超过264,000项研究，涵盖来自基于阵列和基于测序实验的800多万个样本。

When to Use This Skill

何时使用该技能

This skill should be used when searching for gene expression datasets, retrieving experimental data, downloading raw and processed files, querying expression profiles, or integrating GEO data into computational analysis workflows.

当你需要搜索基因表达数据集、检索实验数据、下载原始和处理后的文件、查询表达谱，或者将GEO数据整合到计算分析工作流中时，应该使用该技能。

Core Capabilities

核心功能

1. Understanding GEO Data Organization

1. 了解GEO数据组织方式

GEO organizes data hierarchically using different accession types:

Series (GSE): A complete experiment with a set of related samples

Example: GSE123456
Contains experimental design, samples, and overall study information
Largest organizational unit in GEO
Current count: 264,928+ series

Sample (GSM): A single experimental sample or biological replicate

Example: GSM987654
Contains individual sample data, protocols, and metadata
Linked to platforms and series
Current count: 8,068,632+ samples

Platform (GPL): The microarray or sequencing platform used

Example: GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array)
Describes the technology and probe/feature annotations
Shared across multiple experiments
Current count: 27,739+ platforms

DataSet (GDS): Curated collections with consistent formatting

Example: GDS5678
Experimentally-comparable samples organized by study design
Processed for differential analysis
Subset of GEO data (4,348 curated datasets)
Ideal for quick comparative analyses

Profiles: Gene-specific expression data linked to sequence features

Queryable by gene name or annotation
Cross-references to Entrez Gene
Enables gene-centric searches across all studies

GEO使用不同的登录号类型对数据进行分层组织：

**系列（GSE）：**包含一组相关样本的完整实验

示例：GSE123456
包含实验设计、样本和整体研究信息
GEO中最大的组织单元
当前数量：264,928+个系列

**样本（GSM）：**单个实验样本或生物学重复

示例：GSM987654
包含单个样本数据、实验方案和元数据
与平台和系列相关联
当前数量：8,068,632+个样本

**平台（GPL）：**使用的微阵列或测序平台

示例：GPL570（Affymetrix人类基因组U133 Plus 2.0阵列）
描述技术和探针/特征注释
可在多个实验间共享
当前数量：27,739+个平台

**数据集（GDS）：**格式一致的 curated 集合

示例：GDS5678
按研究设计组织的具有实验可比性的样本
经过处理可用于差异分析
GEO数据的子集（4,348个 curated 数据集）
非常适合快速比较分析

**表达谱：**与序列特征关联的基因特异性表达数据

可通过基因名称或注释进行查询
与Entrez Gene交叉引用
支持跨所有研究的基因中心搜索

2. Searching GEO Data

2. 搜索GEO数据

GEO DataSets Search:

Search for studies by keywords, organism, or experimental conditions:

python

from Bio import Entrez

GEO数据集搜索：

按关键词、生物或实验条件搜索研究：

python

from Bio import Entrez

Configure Entrez (required)

Entrez.email = "your.email@example.com"

Search for datasets

def search_geo_datasets(query, retmax=20): """Search GEO DataSets database""" handle = Entrez.esearch( db="gds", term=query, retmax=retmax, usehistory="y" ) results = Entrez.read(handle) handle.close() return results

Example searches

results = search_geo_datasets("breast cancer[MeSH] AND Homo sapiens[Organism]") print(f"Found {results['Count']} datasets")

Search by specific platform

results = search_geo_datasets("GPL570[Accession]")

Search by study type

results = search_geo_datasets("expression profiling by array[DataSet Type]")


**GEO Profiles Search:**

Find gene-specific expression patterns:

```python

results = search_geo_datasets("expression profiling by array[DataSet Type]")


**GEO表达谱搜索：**

查找基因特异性表达模式：

```python

Search for gene expression profiles

def search_geo_profiles(gene_name, organism="Homo sapiens", retmax=100): """Search GEO Profiles for a specific gene""" query = f"{gene_name}[Gene Name] AND {organism}[Organism]" handle = Entrez.esearch( db="geoprofiles", term=query, retmax=retmax ) results = Entrez.read(handle) handle.close() return results

Find TP53 expression across studies

tp53_results = search_geo_profiles("TP53", organism="Homo sapiens") print(f"Found {tp53_results['Count']} expression profiles for TP53")


**Advanced Search Patterns:**

```python

tp53_results = search_geo_profiles("TP53", organism="Homo sapiens") print(f"Found {tp53_results['Count']} expression profiles for TP53")


**高级搜索模式：**

```python

Combine multiple search terms

def advanced_geo_search(terms, operator="AND"): """Build complex search queries""" query = f" {operator} ".join(terms) return search_geo_datasets(query)

Find recent high-throughput studies

search_terms = [ "RNA-seq[DataSet Type]", "Homo sapiens[Organism]", "2024[Publication Date]" ] results = advanced_geo_search(search_terms)

Search by author and condition

search_terms = [ "Smith[Author]", "diabetes[Disease]" ] results = advanced_geo_search(search_terms)

undefined

search_terms = [ "Smith[Author]", "diabetes[Disease]" ] results = advanced_geo_search(search_terms)

undefined

3. Retrieving GEO Data with GEOparse (Recommended)

3. 使用GEOparse检索GEO数据（推荐）

GEOparse is the primary Python library for accessing GEO data:

Installation:

bash

uv pip install GEOparse

Basic Usage:

python

import GEOparse

GEOparse是访问GEO数据的主要Python库：

安装：

bash

uv pip install GEOparse

基本用法：

python

import GEOparse

Download and parse a GEO Series

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

Access series metadata

print(gse.metadata['title']) print(gse.metadata['summary']) print(gse.metadata['overall_design'])

Access sample information

for gsm_name, gsm in gse.gsms.items(): print(f"Sample: {gsm_name}") print(f" Title: {gsm.metadata['title'][0]}") print(f" Source: {gsm.metadata['source_name_ch1'][0]}") print(f" Characteristics: {gsm.metadata.get('characteristics_ch1', [])}")

Access platform information

for gpl_name, gpl in gse.gpls.items(): print(f"Platform: {gpl_name}") print(f" Title: {gpl.metadata['title'][0]}") print(f" Organism: {gpl.metadata['organism'][0]}")


**Working with Expression Data:**

```python
import GEOparse
import pandas as pd

for gpl_name, gpl in gse.gpls.items(): print(f"Platform: {gpl_name}") print(f" Title: {gpl.metadata['title'][0]}") print(f" Organism: {gpl.metadata['organism'][0]}")


**处理表达数据：**

```python
import GEOparse
import pandas as pd

Get expression data from series

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

Extract expression matrix

Method 1: From series matrix file (fastest)

if hasattr(gse, 'pivot_samples'): expression_df = gse.pivot_samples('VALUE') print(expression_df.shape) # genes x samples

Method 2: From individual samples

expression_data = {} for gsm_name, gsm in gse.gsms.items(): if hasattr(gsm, 'table'): expression_data[gsm_name] = gsm.table['VALUE']

expression_df = pd.DataFrame(expression_data) print(f"Expression matrix: {expression_df.shape}")


**Accessing Supplementary Files:**

```python
import GEOparse

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

expression_data = {} for gsm_name, gsm in gse.gsms.items(): if hasattr(gsm, 'table'): expression_data[gsm_name] = gsm.table['VALUE']

expression_df = pd.DataFrame(expression_data) print(f"Expression matrix: {expression_df.shape}")


**访问补充文件：**

```python
import GEOparse

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

Download supplementary files

gse.download_supplementary_files( directory="./data/GSE123456_suppl", download_sra=False # Set to True to download SRA files )

List available supplementary files

for gsm_name, gsm in gse.gsms.items(): if hasattr(gsm, 'supplementary_files'): print(f"Sample {gsm_name}:") for file_url in gsm.metadata.get('supplementary_file', []): print(f" {file_url}")


**Filtering and Subsetting Data:**

```python
import GEOparse

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

for gsm_name, gsm in gse.gsms.items(): if hasattr(gsm, 'supplementary_files'): print(f"Sample {gsm_name}:") for file_url in gsm.metadata.get('supplementary_file', []): print(f" {file_url}")


**过滤和子集化数据：**

```python
import GEOparse

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")

Filter samples by metadata

control_samples = [ gsm_name for gsm_name, gsm in gse.gsms.items() if 'control' in gsm.metadata.get('title', [''])[0].lower() ]

treatment_samples = [ gsm_name for gsm_name, gsm in gse.gsms.items() if 'treatment' in gsm.metadata.get('title', [''])[0].lower() ]

print(f"Control samples: {len(control_samples)}") print(f"Treatment samples: {len(treatment_samples)}")

control_samples = [ gsm_name for gsm_name, gsm in gse.gsms.items() if 'control' in gsm.metadata.get('title', [''])[0].lower() ]

treatment_samples = [ gsm_name for gsm_name, gsm in gse.gsms.items() if 'treatment' in gsm.metadata.get('title', [''])[0].lower() ]

print(f"Control samples: {len(control_samples)}") print(f"Treatment samples: {len(treatment_samples)}")

Extract subset expression matrix

expression_df = gse.pivot_samples('VALUE') control_expr = expression_df[control_samples] treatment_expr = expression_df[treatment_samples]

undefined

expression_df = gse.pivot_samples('VALUE') control_expr = expression_df[control_samples] treatment_expr = expression_df[treatment_samples]

undefined

4. Using NCBI E-utilities for GEO Access

4. 使用NCBI E-utilities访问GEO

E-utilities provide lower-level programmatic access to GEO metadata:

Basic E-utilities Workflow:

python

from Bio import Entrez
import time

Entrez.email = "your.email@example.com"

E-utilities提供对GEO元数据的底层编程访问：

基本E-utilities工作流：

python

from Bio import Entrez
import time

Entrez.email = "your.email@example.com"

Step 1: Search for GEO entries

def search_geo(query, db="gds", retmax=100): """Search GEO using E-utilities""" handle = Entrez.esearch( db=db, term=query, retmax=retmax, usehistory="y" ) results = Entrez.read(handle) handle.close() return results

Step 2: Fetch summaries

def fetch_geo_summaries(id_list, db="gds"): """Fetch document summaries for GEO entries""" ids = ",".join(id_list) handle = Entrez.esummary(db=db, id=ids) summaries = Entrez.read(handle) handle.close() return summaries

Step 3: Fetch full records

def fetch_geo_records(id_list, db="gds"): """Fetch full GEO records""" ids = ",".join(id_list) handle = Entrez.efetch(db=db, id=ids, retmode="xml") records = Entrez.read(handle) handle.close() return records

Example workflow

search_results = search_geo("breast cancer AND Homo sapiens") id_list = search_results['IdList'][:5]

summaries = fetch_geo_summaries(id_list) for summary in summaries: print(f"GDS: {summary.get('Accession', 'N/A')}") print(f"Title: {summary.get('title', 'N/A')}") print(f"Samples: {summary.get('n_samples', 'N/A')}") print()


**Batch Processing with E-utilities:**

```python
from Bio import Entrez
import time

Entrez.email = "your.email@example.com"

def batch_fetch_geo_metadata(accessions, batch_size=100):
    """Fetch metadata for multiple GEO accessions"""
    results = {}

    for i in range(0, len(accessions), batch_size):
        batch = accessions[i:i + batch_size]

        # Search for each accession
        for accession in batch:
            try:
                query = f"{accession}[Accession]"
                search_handle = Entrez.esearch(db="gds", term=query)
                search_results = Entrez.read(search_handle)
                search_handle.close()

                if search_results['IdList']:
                    # Fetch summary
                    summary_handle = Entrez.esummary(
                        db="gds",
                        id=search_results['IdList'][0]
                    )
                    summary = Entrez.read(summary_handle)
                    summary_handle.close()
                    results[accession] = summary[0]

                # Be polite to NCBI servers
                time.sleep(0.34)  # Max 3 requests per second

            except Exception as e:
                print(f"Error fetching {accession}: {e}")

    return results

search_results = search_geo("breast cancer AND Homo sapiens") id_list = search_results['IdList'][:5]

summaries = fetch_geo_summaries(id_list) for summary in summaries: print(f"GDS: {summary.get('Accession', 'N/A')}") print(f"Title: {summary.get('title', 'N/A')}") print(f"Samples: {summary.get('n_samples', 'N/A')}") print()


**使用E-utilities进行批量处理：**

```python
from Bio import Entrez
import time

Entrez.email = "your.email@example.com"

def batch_fetch_geo_metadata(accessions, batch_size=100):
    """Fetch metadata for multiple GEO accessions"""
    results = {}

    for i in range(0, len(accessions), batch_size):
        batch = accessions[i:i + batch_size]

        # Search for each accession
        for accession in batch:
            try:
                query = f"{accession}[Accession]"
                search_handle = Entrez.esearch(db="gds", term=query)
                search_results = Entrez.read(search_handle)
                search_handle.close()

                if search_results['IdList']:
                    # Fetch summary
                    summary_handle = Entrez.esummary(
                        db="gds",
                        id=search_results['IdList'][0]
                    )
                    summary = Entrez.read(summary_handle)
                    summary_handle.close()
                    results[accession] = summary[0]

                # Be polite to NCBI servers
                time.sleep(0.34)  # Max 3 requests per second

            except Exception as e:
                print(f"Error fetching {accession}: {e}")

    return results

Fetch metadata for multiple datasets

gse_list = ["GSE100001", "GSE100002", "GSE100003"] metadata = batch_fetch_geo_metadata(gse_list)

undefined

gse_list = ["GSE100001", "GSE100002", "GSE100003"] metadata = batch_fetch_geo_metadata(gse_list)

undefined

5. Direct FTP Access for Data Files

5. 直接通过FTP访问数据文件

FTP URLs for GEO Data:

GEO data can be downloaded directly via FTP:

python

import ftplib
import os

def download_geo_ftp(accession, file_type="matrix", dest_dir="./data"):
    """Download GEO files via FTP"""
    # Construct FTP path based on accession type
    if accession.startswith("GSE"):
        # Series files
        gse_num = accession[3:]
        base_num = gse_num[:-3] + "nnn"
        ftp_path = f"/geo/series/GSE{base_num}/{accession}/"

        if file_type == "matrix":
            filename = f"{accession}_series_matrix.txt.gz"
        elif file_type == "soft":
            filename = f"{accession}_family.soft.gz"
        elif file_type == "miniml":
            filename = f"{accession}_family.xml.tgz"

    # Connect to FTP server
    ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov")
    ftp.login()
    ftp.cwd(ftp_path)

    # Download file
    os.makedirs(dest_dir, exist_ok=True)
    local_file = os.path.join(dest_dir, filename)

    with open(local_file, 'wb') as f:
        ftp.retrbinary(f'RETR {filename}', f.write)

    ftp.quit()
    print(f"Downloaded: {local_file}")
    return local_file

GEO数据的FTP地址：

可通过FTP直接下载GEO数据：

python

import ftplib
import os

def download_geo_ftp(accession, file_type="matrix", dest_dir="./data"):
    """Download GEO files via FTP"""
    # Construct FTP path based on accession type
    if accession.startswith("GSE"):
        # Series files
        gse_num = accession[3:]
        base_num = gse_num[:-3] + "nnn"
        ftp_path = f"/geo/series/GSE{base_num}/{accession}/"

        if file_type == "matrix":
            filename = f"{accession}_series_matrix.txt.gz"
        elif file_type == "soft":
            filename = f"{accession}_family.soft.gz"
        elif file_type == "miniml":
            filename = f"{accession}_family.xml.tgz"

    # Connect to FTP server
    ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov")
    ftp.login()
    ftp.cwd(ftp_path)

    # Download file
    os.makedirs(dest_dir, exist_ok=True)
    local_file = os.path.join(dest_dir, filename)

    with open(local_file, 'wb') as f:
        ftp.retrbinary(f'RETR {filename}', f.write)

    ftp.quit()
    print(f"Downloaded: {local_file}")
    return local_file

Download series matrix file

download_geo_ftp("GSE123456", file_type="matrix")

Download SOFT format file

download_geo_ftp("GSE123456", file_type="soft")


**Using wget or curl for Downloads:**

```bash

download_geo_ftp("GSE123456", file_type="soft")


**使用wget或curl下载：**

```bash

Download series matrix file

wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/matrix/GSE123456_series_matrix.txt.gz

Download all supplementary files for a series

wget -r -np -nd ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/suppl/

Download SOFT format family file

wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/soft/GSE123456_family.soft.gz

undefined

wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE123nnn/GSE123456/soft/GSE123456_family.soft.gz

undefined

6. Analyzing GEO Data

6. 分析GEO数据

Quality Control and Preprocessing:

python

import GEOparse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

质量控制与预处理：

python

import GEOparse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Load dataset

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data") expression_df = gse.pivot_samples('VALUE')

Check for missing values

print(f"Missing values: {expression_df.isnull().sum().sum()}")

Log transformation (if needed)

if expression_df.min().min() > 0: # Check if already log-transformed if expression_df.max().max() > 100: expression_df = np.log2(expression_df + 1) print("Applied log2 transformation")

Distribution plots

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1) expression_df.plot.box(ax=plt.gca()) plt.title("Expression Distribution per Sample") plt.xticks(rotation=90)

plt.subplot(1, 2, 2) expression_df.mean(axis=1).hist(bins=50) plt.title("Gene Expression Distribution") plt.xlabel("Average Expression")

plt.tight_layout() plt.savefig("geo_qc.png", dpi=300, bbox_inches='tight')


**Differential Expression Analysis:**

```python
import GEOparse
import pandas as pd
import numpy as np
from scipy import stats

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1) expression_df.plot.box(ax=plt.gca()) plt.title("Expression Distribution per Sample") plt.xticks(rotation=90)

plt.subplot(1, 2, 2) expression_df.mean(axis=1).hist(bins=50) plt.title("Gene Expression Distribution") plt.xlabel("Average Expression")

plt.tight_layout() plt.savefig("geo_qc.png", dpi=300, bbox_inches='tight')


**差异表达分析：**

```python
import GEOparse
import pandas as pd
import numpy as np
from scipy import stats

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')

Define sample groups

control_samples = ["GSM1", "GSM2", "GSM3"] treatment_samples = ["GSM4", "GSM5", "GSM6"]

Calculate fold changes and p-values

results = [] for gene in expression_df.index: control_expr = expression_df.loc[gene, control_samples] treatment_expr = expression_df.loc[gene, treatment_samples]

# Calculate statistics
fold_change = treatment_expr.mean() - control_expr.mean()
t_stat, p_value = stats.ttest_ind(treatment_expr, control_expr)

results.append({
    'gene': gene,
    'log2_fold_change': fold_change,
    'p_value': p_value,
    'control_mean': control_expr.mean(),
    'treatment_mean': treatment_expr.mean()
})

results = [] for gene in expression_df.index: control_expr = expression_df.loc[gene, control_samples] treatment_expr = expression_df.loc[gene, treatment_samples]

# Calculate statistics
fold_change = treatment_expr.mean() - control_expr.mean()
t_stat, p_value = stats.ttest_ind(treatment_expr, control_expr)

results.append({
    'gene': gene,
    'log2_fold_change': fold_change,
    'p_value': p_value,
    'control_mean': control_expr.mean(),
    'treatment_mean': treatment_expr.mean()
})

Create results DataFrame

de_results = pd.DataFrame(results)

Multiple testing correction (Benjamini-Hochberg)

from statsmodels.stats.multitest import multipletests _, de_results['q_value'], _, _ = multipletests( de_results['p_value'], method='fdr_bh' )

Filter significant genes

significant_genes = de_results[ (de_results['q_value'] < 0.05) & (abs(de_results['log2_fold_change']) > 1) ]

print(f"Significant genes: {len(significant_genes)}") significant_genes.to_csv("de_results.csv", index=False)


**Correlation and Clustering Analysis:**

```python
import GEOparse
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')

significant_genes = de_results[ (de_results['q_value'] < 0.05) & (abs(de_results['log2_fold_change']) > 1) ]

print(f"Significant genes: {len(significant_genes)}") significant_genes.to_csv("de_results.csv", index=False)


**相关性与聚类分析：**

```python
import GEOparse
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy
from scipy.spatial.distance import pdist

gse = GEOparse.get_GEO(geo="GSE123456", destdir="./data")
expression_df = gse.pivot_samples('VALUE')

Sample correlation heatmap

sample_corr = expression_df.corr()

plt.figure(figsize=(10, 8)) sns.heatmap(sample_corr, cmap='coolwarm', center=0, square=True, linewidths=0.5) plt.title("Sample Correlation Matrix") plt.tight_layout() plt.savefig("sample_correlation.png", dpi=300, bbox_inches='tight')

sample_corr = expression_df.corr()

plt.figure(figsize=(10, 8)) sns.heatmap(sample_corr, cmap='coolwarm', center=0, square=True, linewidths=0.5) plt.title("Sample Correlation Matrix") plt.tight_layout() plt.savefig("sample_correlation.png", dpi=300, bbox_inches='tight')

Hierarchical clustering

distances = pdist(expression_df.T, metric='correlation') linkage = hierarchy.linkage(distances, method='average')

plt.figure(figsize=(12, 6)) hierarchy.dendrogram(linkage, labels=expression_df.columns) plt.title("Hierarchical Clustering of Samples") plt.xlabel("Samples") plt.ylabel("Distance") plt.xticks(rotation=90) plt.tight_layout() plt.savefig("sample_clustering.png", dpi=300, bbox_inches='tight')

undefined

distances = pdist(expression_df.T, metric='correlation') linkage = hierarchy.linkage(distances, method='average')

plt.figure(figsize=(12, 6)) hierarchy.dendrogram(linkage, labels=expression_df.columns) plt.title("Hierarchical Clustering of Samples") plt.xlabel("Samples") plt.ylabel("Distance") plt.xticks(rotation=90) plt.tight_layout() plt.savefig("sample_clustering.png", dpi=300, bbox_inches='tight')

undefined

7. Batch Processing Multiple Datasets

7. 批量处理多个数据集

Download and Process Multiple Series:

python

import GEOparse
import pandas as pd
import os

def batch_download_geo(gse_list, destdir="./geo_data"):
    """Download multiple GEO series"""
    results = {}

    for gse_id in gse_list:
        try:
            print(f"Processing {gse_id}...")
            gse = GEOparse.get_GEO(geo=gse_id, destdir=destdir)

            # Extract key information
            results[gse_id] = {
                'title': gse.metadata.get('title', ['N/A'])[0],
                'organism': gse.metadata.get('organism', ['N/A'])[0],
                'platform': list(gse.gpls.keys())[0] if gse.gpls else 'N/A',
                'num_samples': len(gse.gsms),
                'submission_date': gse.metadata.get('submission_date', ['N/A'])[0]
            }

            # Save expression data
            if hasattr(gse, 'pivot_samples'):
                expr_df = gse.pivot_samples('VALUE')
                expr_df.to_csv(f"{destdir}/{gse_id}_expression.csv")
                results[gse_id]['num_genes'] = len(expr_df)

        except Exception as e:
            print(f"Error processing {gse_id}: {e}")
            results[gse_id] = {'error': str(e)}

    # Save summary
    summary_df = pd.DataFrame(results).T
    summary_df.to_csv(f"{destdir}/batch_summary.csv")

    return results

下载并处理多个系列：

python

import GEOparse
import pandas as pd
import os

def batch_download_geo(gse_list, destdir="./geo_data"):
    """Download multiple GEO series"""
    results = {}

    for gse_id in gse_list:
        try:
            print(f"Processing {gse_id}...")
            gse = GEOparse.get_GEO(geo=gse_id, destdir=destdir)

            # Extract key information
            results[gse_id] = {
                'title': gse.metadata.get('title', ['N/A'])[0],
                'organism': gse.metadata.get('organism', ['N/A'])[0],
                'platform': list(gse.gpls.keys())[0] if gse.gpls else 'N/A',
                'num_samples': len(gse.gsms),
                'submission_date': gse.metadata.get('submission_date', ['N/A'])[0]
            }

            # Save expression data
            if hasattr(gse, 'pivot_samples'):
                expr_df = gse.pivot_samples('VALUE')
                expr_df.to_csv(f"{destdir}/{gse_id}_expression.csv")
                results[gse_id]['num_genes'] = len(expr_df)

        except Exception as e:
            print(f"Error processing {gse_id}: {e}")
            results[gse_id] = {'error': str(e)}

    # Save summary
    summary_df = pd.DataFrame(results).T
    summary_df.to_csv(f"{destdir}/batch_summary.csv")

    return results

Process multiple datasets

gse_list = ["GSE100001", "GSE100002", "GSE100003"] results = batch_download_geo(gse_list)


**Meta-Analysis Across Studies:**

```python
import GEOparse
import pandas as pd
import numpy as np

def meta_analysis_geo(gse_list, gene_of_interest):
    """Perform meta-analysis of gene expression across studies"""
    results = []

    for gse_id in gse_list:
        try:
            gse = GEOparse.get_GEO(geo=gse_id, destdir="./data")

            # Get platform annotation
            gpl = list(gse.gpls.values())[0]

            # Find gene in platform
            if hasattr(gpl, 'table'):
                gene_probes = gpl.table[
                    gpl.table['Gene Symbol'].str.contains(
                        gene_of_interest,
                        case=False,
                        na=False
                    )
                ]

                if not gene_probes.empty:
                    expr_df = gse.pivot_samples('VALUE')

                    for probe_id in gene_probes['ID']:
                        if probe_id in expr_df.index:
                            expr_values = expr_df.loc[probe_id]

                            results.append({
                                'study': gse_id,
                                'probe': probe_id,
                                'mean_expression': expr_values.mean(),
                                'std_expression': expr_values.std(),
                                'num_samples': len(expr_values)
                            })

        except Exception as e:
            print(f"Error in {gse_id}: {e}")

    return pd.DataFrame(results)

gse_list = ["GSE100001", "GSE100002", "GSE100003"] results = batch_download_geo(gse_list)


**跨研究的荟萃分析：**

```python
import GEOparse
import pandas as pd
import numpy as np

def meta_analysis_geo(gse_list, gene_of_interest):
    """Perform meta-analysis of gene expression across studies"""
    results = []

    for gse_id in gse_list:
        try:
            gse = GEOparse.get_GEO(geo=gse_id, destdir="./data")

            # Get platform annotation
            gpl = list(gse.gpls.values())[0]

            # Find gene in platform
            if hasattr(gpl, 'table'):
                gene_probes = gpl.table[
                    gpl.table['Gene Symbol'].str.contains(
                        gene_of_interest,
                        case=False,
                        na=False
                    )
                ]

                if not gene_probes.empty:
                    expr_df = gse.pivot_samples('VALUE')

                    for probe_id in gene_probes['ID']:
                        if probe_id in expr_df.index:
                            expr_values = expr_df.loc[probe_id]

                            results.append({
                                'study': gse_id,
                                'probe': probe_id,
                                'mean_expression': expr_values.mean(),
                                'std_expression': expr_values.std(),
                                'num_samples': len(expr_values)
                            })

        except Exception as e:
            print(f"Error in {gse_id}: {e}")

    return pd.DataFrame(results)

Meta-analysis for TP53

gse_studies = ["GSE100001", "GSE100002", "GSE100003"] meta_results = meta_analysis_geo(gse_studies, "TP53") print(meta_results)

undefined

gse_studies = ["GSE100001", "GSE100002", "GSE100003"] meta_results = meta_analysis_geo(gse_studies, "TP53") print(meta_results)

undefined

Installation and Setup

安装与设置

Python Libraries

Python库

bash

undefined

bash

undefined

Primary GEO access library (recommended)

uv pip install GEOparse

For E-utilities and programmatic NCBI access

uv pip install biopython

For data analysis

uv pip install pandas numpy scipy

For visualization

uv pip install matplotlib seaborn

For statistical analysis

uv pip install statsmodels scikit-learn

undefined

uv pip install statsmodels scikit-learn

undefined

Configuration

配置

Set up NCBI E-utilities access:

python

from Bio import Entrez

设置NCBI E-utilities访问：

python

from Bio import Entrez

Always set your email (required by NCBI)

Entrez.email = "your.email@example.com"

Optional: Set API key for increased rate limits

Get your API key from: https://www.ncbi.nlm.nih.gov/account/

Entrez.api_key = "your_api_key_here"

With API key: 10 requests/second

Without API key: 3 requests/second

undefined

undefined

Common Use Cases

常见用例

Transcriptomics Research

转录组学研究

Download gene expression data for specific conditions
Compare expression profiles across studies
Identify differentially expressed genes
Perform meta-analyses across multiple datasets

下载特定条件下的基因表达数据
跨研究比较表达谱
鉴定差异表达基因
跨多个数据集进行荟萃分析

Drug Response Studies

药物反应研究

Analyze gene expression changes after drug treatment
Identify biomarkers for drug response
Compare drug effects across cell lines or patients
Build predictive models for drug sensitivity

分析药物处理后的基因表达变化
鉴定药物反应的生物标志物
跨细胞系或患者比较药物效果
构建药物敏感性预测模型

Disease Biology

疾病生物学

Study gene expression in disease vs. normal tissues
Identify disease-associated expression signatures
Compare patient subgroups and disease stages
Correlate expression with clinical outcomes

研究疾病与正常组织中的基因表达
鉴定疾病相关的表达特征
比较患者亚组和疾病阶段
将表达与临床结果关联

Biomarker Discovery

生物标志物发现

Screen for diagnostic or prognostic markers
Validate biomarkers across independent cohorts
Compare marker performance across platforms
Integrate expression with clinical data

筛选诊断或预后标志物
在独立队列中验证生物标志物
跨平台比较标志物性能
将表达与临床数据整合

Key Concepts

关键概念

SOFT (Simple Omnibus Format in Text): GEO's primary text-based format containing metadata and data tables. Easily parsed by GEOparse.

MINiML (MIAME Notation in Markup Language): XML format for GEO data, used for programmatic access and data exchange.

Series Matrix: Tab-delimited expression matrix with samples as columns and genes/probes as rows. Fastest format for getting expression data.

MIAME Compliance: Minimum Information About a Microarray Experiment - standardized annotation that GEO enforces for all submissions.

Expression Value Types: Different types of expression measurements (raw signal, normalized, log-transformed). Always check platform and processing methods.

Platform Annotation: Maps probe/feature IDs to genes. Essential for biological interpretation of expression data.

**SOFT（Simple Omnibus Format in Text）：**GEO的主要文本格式，包含元数据和数据表。可通过GEOparse轻松解析。

**MINiML（MIAME Notation in Markup Language）：**GEO数据的XML格式，用于编程访问和数据交换。

**系列矩阵：**制表符分隔的表达矩阵，样本为列，基因/探针为行。获取表达数据的最快格式。

**MIAME合规性：**微阵列实验最小信息标准 - GEO对所有提交内容强制执行的标准化注释要求。

**表达值类型：**不同类型的表达测量值（原始信号、归一化值、对数转换值）。请始终检查平台和处理方法。

**平台注释：**将探针/特征ID映射到基因。对表达数据的生物学解释至关重要。

GEO2R Web Tool

GEO2R网络工具

For quick analysis without coding, use GEO2R:

Web-based statistical analysis tool integrated into GEO
Accessible at: https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSExxxxx
Performs differential expression analysis
Generates R scripts for reproducibility
Useful for exploratory analysis before downloading data

无需编码即可快速分析，可使用GEO2R：

集成在GEO中的基于网络的统计分析工具
访问地址：https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSExxxxx
执行差异表达分析
生成可复现的R脚本
适合在下载数据前进行探索性分析

Rate Limiting and Best Practices

速率限制与最佳实践

NCBI E-utilities Rate Limits:

Without API key: 3 requests per second
With API key: 10 requests per second
Implement delays between requests:
```
time.sleep(0.34)
```
(no API key) or
```
time.sleep(0.1)
```
(with API key)

FTP Access:

No rate limits for FTP downloads
Preferred method for bulk downloads
Can download entire directories with wget -r

GEOparse Caching:

GEOparse automatically caches downloaded files in destdir
Subsequent calls use cached data
Clean cache periodically to save disk space

Optimal Practices:

Use GEOparse for series-level access (easiest)
Use E-utilities for metadata searching and batch queries
Use FTP for direct file downloads and bulk operations
Cache data locally to avoid repeated downloads
Always set Entrez.email when using Biopython

NCBI E-utilities速率限制：

无API密钥：每秒3次请求
有API密钥：每秒10次请求
在请求之间实现延迟：
```
time.sleep(0.34)
```
（无API密钥）或
```
time.sleep(0.1)
```
（有API密钥）

FTP访问：

FTP下载无速率限制
批量下载的首选方法
可使用wget -r下载整个目录

GEOparse缓存：

GEOparse会自动将下载的文件缓存到destdir中
后续调用会使用缓存数据
定期清理缓存以节省磁盘空间

最佳实践：

使用GEOparse进行系列级访问（最简单）
使用E-utilities进行元数据搜索和批量查询
使用FTP进行直接文件下载和批量操作
本地缓存数据以避免重复下载
使用Biopython时始终设置Entrez.email

Resources

资源

references/geo_reference.md

Comprehensive reference documentation covering:

Detailed E-utilities API specifications and endpoints
Complete SOFT and MINiML file format documentation
Advanced GEOparse usage patterns and examples
FTP directory structure and file naming conventions
Data processing pipelines and normalization methods
Troubleshooting common issues and error handling
Platform-specific considerations and quirks

Consult this reference for in-depth technical details, complex query patterns, or when working with uncommon data formats.

全面的参考文档，涵盖：

详细的E-utilities API规范和端点
完整的SOFT和MINiML文件格式文档
高级GEOparse使用模式和示例
FTP目录结构和文件命名约定
数据处理流水线和归一化方法
常见问题排查和错误处理
平台特定注意事项和特殊情况

如需深入技术细节、复杂查询模式或处理不常见数据格式，请参考此文档。

Important Notes

重要说明

Data Quality Considerations

数据质量注意事项

GEO accepts user-submitted data with varying quality standards
Always check platform annotation and processing methods
Verify sample metadata and experimental design
Be cautious with batch effects across studies
Consider reprocessing raw data for consistency

GEO接受用户提交的数据，质量标准各不相同
始终检查平台注释和处理方法
验证样本元数据和实验设计
注意跨研究的批次效应
考虑重新处理原始数据以保证一致性

File Size Warnings

文件大小警告

Series matrix files can be large (>1 GB for large studies)
Supplementary files (e.g., CEL files) can be very large
Plan for adequate disk space before downloading
Consider downloading samples incrementally

系列矩阵文件可能很大（大型研究超过1 GB）
补充文件（如CEL文件）可能非常大
下载前确保有足够的磁盘空间
考虑增量下载样本

Data Usage and Citation

数据使用与引用

GEO data is freely available for research use
Always cite original studies when using GEO data
Cite GEO database: Barrett et al. (2013) Nucleic Acids Research
Check individual dataset usage restrictions (if any)
Follow NCBI guidelines for programmatic access

GEO数据可免费用于研究
使用GEO数据时始终引用原始研究
引用GEO数据库：Barrett et al. (2013) Nucleic Acids Research
检查各个数据集的使用限制（如有）
遵循NCBI编程访问指南

Common Pitfalls

常见陷阱

Different platforms use different probe IDs (requires annotation mapping)
Expression values may be raw, normalized, or log-transformed (check metadata)
Sample metadata can be inconsistently formatted across studies
Not all series have series matrix files (older submissions)
Platform annotations may be outdated (genes renamed, IDs deprecated)

不同平台使用不同的探针ID（需要注释映射）
表达值可能是原始信号、归一化值或对数转换值（请检查元数据）
样本元数据在不同研究中的格式可能不一致
并非所有系列都有系列矩阵文件（早期提交的研究）
平台注释可能过时（基因重命名、ID弃用）

Additional Resources

其他资源

GEO Website: https://www.ncbi.nlm.nih.gov/geo/
GEO Submission Guidelines: https://www.ncbi.nlm.nih.gov/geo/info/submission.html
GEOparse Documentation: https://geoparse.readthedocs.io/
E-utilities Documentation: https://www.ncbi.nlm.nih.gov/books/NBK25501/
GEO FTP Site: ftp://ftp.ncbi.nlm.nih.gov/geo/
GEO2R Tool: https://www.ncbi.nlm.nih.gov/geo/geo2r/
NCBI API Keys: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
Biopython Tutorial: https://biopython.org/DIST/docs/tutorial/Tutorial.html

GEO官网： https://www.ncbi.nlm.nih.gov/geo/
GEO提交指南： https://www.ncbi.nlm.nih.gov/geo/info/submission.html
GEOparse文档： https://geoparse.readthedocs.io/
E-utilities文档： https://www.ncbi.nlm.nih.gov/books/NBK25501/
GEO FTP站点： ftp://ftp.ncbi.nlm.nih.gov/geo/
GEO2R工具： https://www.ncbi.nlm.nih.gov/geo/geo2r/
NCBI API密钥： https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
Biopython教程： https://biopython.org/DIST/docs/tutorial/Tutorial.html

geo-database

Original

Translation

GEO Database

GEO数据库

Overview

概述

When to Use This Skill

何时使用该技能

Core Capabilities

核心功能

1. Understanding GEO Data Organization

1. 了解GEO数据组织方式

2. Searching GEO Data

2. 搜索GEO数据

Configure Entrez (required)

Configure Entrez (required)

Search for datasets

Search for datasets

Example searches

Example searches

Search by specific platform

Search by specific platform

Search by study type

Search by study type

Search for gene expression profiles

Search for gene expression profiles

Find TP53 expression across studies

Find TP53 expression across studies

Combine multiple search terms

Combine multiple search terms

Find recent high-throughput studies

Find recent high-throughput studies

Search by author and condition

Search by author and condition

3. Retrieving GEO Data with GEOparse (Recommended)

3. 使用GEOparse检索GEO数据（推荐）

Download and parse a GEO Series

Download and parse a GEO Series

Access series metadata

Access series metadata

Access sample information

Access sample information

Access platform information

Access platform information

Get expression data from series

Get expression data from series

Extract expression matrix

Extract expression matrix

Method 1: From series matrix file (fastest)

Method 1: From series matrix file (fastest)

Method 2: From individual samples

Method 2: From individual samples

Download supplementary files

Download supplementary files

List available supplementary files

List available supplementary files

Filter samples by metadata

Filter samples by metadata

Extract subset expression matrix

Extract subset expression matrix

4. Using NCBI E-utilities for GEO Access

4. 使用NCBI E-utilities访问GEO

Step 1: Search for GEO entries

Step 1: Search for GEO entries

Step 2: Fetch summaries

Step 2: Fetch summaries

Step 3: Fetch full records

Step 3: Fetch full records

Example workflow

Example workflow

Fetch metadata for multiple datasets

Fetch metadata for multiple datasets

5. Direct FTP Access for Data Files

5. 直接通过FTP访问数据文件

Download series matrix file

Download series matrix file

Download SOFT format file

Download SOFT format file

Download series matrix file