tooluniverse-expression-data-retrieval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Gene Expression & Omics Data Retrieval

基因表达与组学数据检索

Retrieve gene expression experiments and multi-omics datasets with proper disambiguation and quality assessment.
IMPORTANT: Always use English terms in tool calls (gene names, tissue names, condition descriptions), even if the user writes in another language. Only try original-language terms as a fallback if English returns no results. Respond in the user's language.
通过合理的消歧和质量评估,检索基因表达实验与多组学数据集。
重要提示:在工具调用中始终使用英文术语(基因名称、组织名称、条件描述),即使用户使用其他语言提问。仅当英文检索无结果时,才尝试使用原语言术语作为备选。请用用户的语言回复。

Workflow Overview

工作流概述

Phase 0: Clarify Query (if ambiguous)
Phase 1: Disambiguate Gene/Condition
Phase 2: Search & Retrieve (Internal)
Phase 3: Report Dataset Profile

阶段0:澄清查询(若存在歧义)
阶段1:基因/条件消歧
阶段2:搜索与检索(内部操作)
阶段3:生成数据集档案报告

Phase 0: Clarification (When Needed)

阶段0:澄清(必要时)

Ask the user ONLY if:
  • Gene name is ambiguous (e.g., "p53" → TP53 or MDM2 studies?)
  • Tissue/condition unclear for comparative studies
  • Organism not specified for non-human research
Skip clarification for:
  • Specific accession numbers (E-MTAB-, E-GEOD-, S-BSST*)
  • Clear disease/tissue + organism combinations
  • Explicit platform requests (RNA-seq, microarray)

仅在以下情况询问用户:
  • 基因名称存在歧义(例如:"p53" → 是TP53还是MDM2相关研究?)
  • 比较研究中的组织/条件不明确
  • 非人类研究未指定物种
以下情况无需澄清:
  • 特定登录号(E-MTAB-, E-GEOD-, S-BSST*)
  • 清晰的疾病/组织+物种组合
  • 明确的平台请求(RNA-seq、微阵列)

Phase 1: Query Disambiguation

阶段1:查询消歧

1.1 Gene Name Resolution

1.1 基因名称解析

If searching by gene, first resolve official identifiers:
python
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()
若按基因搜索,首先解析官方标识符:
python
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()

For gene-focused searches, resolve official symbol first

针对基因相关搜索,先解析官方符号

This helps construct better search queries

这有助于构建更精准的搜索查询

Example: "p53" → "TP53" (official HGNC symbol)

示例:"p53" → "TP53"(HGNC官方符号)


**Gene Disambiguation Checklist:**
- [ ] Official gene symbol identified (HGNC for human, MGI for mouse)
- [ ] Common aliases noted for search expansion
- [ ] Species confirmed

**基因消歧检查清单:**
- [ ] 已识别官方基因符号(人类用HGNC,小鼠用MGI)
- [ ] 已记录常用别名以扩展搜索范围
- [ ] 已确认物种

1.2 Construct Search Strategy

1.2 构建搜索策略

User Query TypeSearch Strategy
Specific accessionDirect retrieval
Gene + condition"[gene] [condition]" + species filter
Disease only"[disease]" + species filter
Technology-specificAdd platform keywords (RNA-seq, microarray)

用户查询类型搜索策略
特定登录号直接检索
基因+条件"[基因] [条件]" + 物种筛选
仅疾病"[疾病]" + 物种筛选
特定技术添加平台关键词(RNA-seq、微阵列)

Phase 2: Data Retrieval (Internal)

阶段2:数据检索(内部操作)

Search silently. Do NOT narrate the process.
静默执行搜索。请勿描述搜索过程。

2.1 Search Experiments

2.1 搜索实验

python
undefined
python
undefined

ArrayExpress search

ArrayExpress搜索

result = tu.tools.arrayexpress_search_experiments( keywords="[gene/disease] [condition]", species="[species]", limit=20 )
result = tu.tools.arrayexpress_search_experiments( keywords="[gene/disease] [condition]", species="[species]", limit=20 )

BioStudies for multi-omics

BioStudies多组学搜索

biostudies_result = tu.tools.biostudies_search_studies( query="[keywords]", limit=10 )
undefined
biostudies_result = tu.tools.biostudies_search_studies( query="[keywords]", limit=10 )
undefined

2.2 Get Experiment Details

2.2 获取实验详情

For top results, retrieve full metadata:
python
undefined
针对顶部结果,检索完整元数据:
python
undefined

Get details for each relevant experiment

获取每个相关实验的详情

details = tu.tools.arrayexpress_get_experiment_details( accession=accession )
details = tu.tools.arrayexpress_get_experiment_details( accession=accession )

Get sample information

获取样本信息

samples = tu.tools.arrayexpress_get_experiment_samples( accession=accession )
samples = tu.tools.arrayexpress_get_experiment_samples( accession=accession )

Get available files

获取可用文件

files = tu.tools.arrayexpress_get_experiment_files( accession=accession )
undefined
files = tu.tools.arrayexpress_get_experiment_files( accession=accession )
undefined

2.3 BioStudies Retrieval

2.3 BioStudies检索

python
undefined
python
undefined

Multi-omics study details

多组学研究详情

study_details = tu.tools.biostudies_get_study_details( accession=study_accession )
study_details = tu.tools.biostudies_get_study_details( accession=study_accession )

Study structure

研究结构

sections = tu.tools.biostudies_get_study_sections( accession=study_accession )
sections = tu.tools.biostudies_get_study_sections( accession=study_accession )

Available files

可用文件

files = tu.tools.biostudies_get_study_files( accession=study_accession )
undefined
files = tu.tools.biostudies_get_study_files( accession=study_accession )
undefined

Fallback Chains

备选检索链

PrimaryFallbackNotes
ArrayExpress searchBioStudies searchArrayExpress empty
arrayexpress_get_experiment_detailsbiostudies_get_study_detailsE-GEOD may have BioStudies mirror
arrayexpress_get_experiment_filesNote "Files unavailable"Some studies restrict downloads

主检索方式备选方式说明
ArrayExpress搜索BioStudies搜索ArrayExpress无结果时
arrayexpress_get_experiment_detailsbiostudies_get_study_detailsE-GEOD可能在BioStudies有镜像
arrayexpress_get_experiment_files标注“文件不可用”部分研究限制下载

Phase 3: Report Dataset Profile

阶段3:生成数据集档案报告

Output Structure

输出结构

Present as a Dataset Search Report. Hide search process.
markdown
undefined
数据集搜索报告形式呈现。隐藏搜索过程。
markdown
undefined

Expression Data: [Query Topic]

表达数据:[查询主题]

Search Summary
  • Query: [gene/disease] in [species]
  • Databases: ArrayExpress, BioStudies
  • Results: [N] relevant experiments found
Data Quality Overview: [assessment based on criteria below]

搜索摘要
  • 查询:[基因/疾病] in [物种]
  • 数据库:ArrayExpress、BioStudies
  • 结果:找到[N]个相关实验
数据质量概览:[基于以下标准的评估结果]

Top Experiments

重点实验

1. [E-MTAB-XXXX]: [Title]

1. [E-MTAB-XXXX]: [标题]

AttributeValue
Accession[accession with link]
Organism[species]
Experiment TypeRNA-seq / Microarray
Platform[specific platform]
Samples[N] samples
Release Date[date]
Description: [Brief description from metadata]
Experimental Design:
  • Conditions: [treatment vs control, etc.]
  • Replicates: [N biological, M technical]
  • Tissue/Cell type: [if specified]
Sample Groups:
GroupSamplesDescription
Control[N][description]
Treatment[N][description]
Data Files Available:
FileTypeSize
[filename]Processed data[size]
[filename]Raw data[size]
[filename]Sample metadata[size]
Quality Assessment: ●●● High / ●●○ Medium / ●○○ Low
  • Sample size: [adequate/limited]
  • Replication: [yes/no]
  • Metadata completeness: [complete/partial]

属性
登录号[带链接的登录号]
物种[物种]
实验类型RNA-seq / 微阵列
平台[具体平台]
样本数[N]个样本
发布日期[日期]
描述:[来自元数据的简要描述]
实验设计:
  • 条件:[处理组 vs 对照组等]
  • 重复:[N个生物学重复,M个技术重复]
  • 组织/细胞类型:[若有指定]
样本分组:
分组样本数描述
对照组[N][描述]
处理组[N][描述]
可用数据文件:
文件类型大小
[文件名]处理后数据[大小]
[文件名]原始数据[大小]
[文件名]样本元数据[大小]
质量评估:●●● 高 / ●●○ 中 / ●○○ 低
  • 样本量:[充足/有限]
  • 重复:[是/否]
  • 元数据完整性:[完整/部分]

2. [E-GEOD-XXXXX]: [Title]

2. [E-GEOD-XXXXX]: [标题]

[Same structure as above]

[与上述结构一致]

Multi-Omics Studies (from BioStudies)

多组学研究(来自BioStudies)

[S-BSST-XXXXX]: [Title]

[S-BSST-XXXXX]: [标题]

AttributeValue
Accession[accession]
Study Type[proteomics/metabolomics/integrated]
Organism[species]
Samples[N]
Data Types Included:
  • Transcriptomics
  • Proteomics
  • Metabolomics
  • Other: [specify]

属性
登录号[登录号]
研究类型蛋白质组学/代谢组学/整合组学
物种[物种]
样本数[N]
包含的数据类型:
  • 转录组学
  • 蛋白质组学
  • 代谢组学
  • 其他:[说明]

Summary Table

汇总表

AccessionTypeSamplesPlatformQuality
[E-MTAB-X]RNA-seq[N]Illumina●●●
[E-GEOD-X]Microarray[N]Affymetrix●●○

登录号类型样本数平台质量
[E-MTAB-X]RNA-seq[N]Illumina●●●
[E-GEOD-X]微阵列[N]Affymetrix●●○

Recommendations

建议

For [specific analysis type]:
  • Best experiment: [accession] - [reason]
  • Alternative: [accession] - [reason]
Data Integration Notes:
  • Platform compatibility: [notes on combining datasets]
  • Batch considerations: [if applicable]

针对[特定分析类型]:
  • 最佳数据集:[登录号] - [理由]
  • 备选数据集:[登录号] - [理由]
数据整合说明:
  • 平台兼容性:[关于数据集合并的说明]
  • 批次效应考虑:[若适用]

Data Access

数据获取

Direct Download Links

直接下载链接

  • E-MTAB-XXXX processed data
  • E-MTAB-XXXX raw data
  • E-MTAB-XXXX处理后数据
  • E-MTAB-XXXX原始数据

Database Links

数据库链接

Data Quality Tiers

数据质量等级

Assessment criteria for expression experiments:
TierSymbolCriteria
High Quality●●●≥3 bio replicates, complete metadata, processed data available
Medium Quality●●○2-3 replicates OR some metadata gaps, data accessible
Low Quality●○○No replicates, sparse metadata, or data access issues
Use with Caution○○○Single sample, no replication, outdated platform
Include assessment rationale:
markdown
**Quality**: ●●● High
- ✓ 4 biological replicates per condition
- ✓ Complete sample annotations
- ✓ Processed and raw data available
- ✓ Recent RNA-seq platform

表达实验评估标准:
等级符号标准
高质量●●●≥3个生物学重复,元数据完整,可获取处理后数据
中质量●●○2-3个重复 或 部分元数据缺失,数据可访问
低质量●○○无重复,元数据稀疏,或数据访问受限
谨慎使用○○○单样本,无重复,平台过时
需包含评估依据:
markdown
**质量**:●●● 高
- ✓ 每个条件含4个生物学重复
- ✓ 样本注释完整
- ✓ 可获取处理后及原始数据
- ✓ 使用最新RNA-seq平台

Completeness Checklist

完整性检查清单

Every dataset report MUST include:
每份数据集报告必须包含:

Per Experiment (Required)

单个实验(必填)

  • Accession number with database link
  • Organism
  • Experiment type (RNA-seq/microarray/etc.)
  • Sample count
  • Brief description
  • Quality assessment
  • 带数据库链接的登录号
  • 物种
  • 实验类型(RNA-seq/微阵列等)
  • 样本数量
  • 简要描述
  • 质量评估

Search Summary (Required)

搜索摘要(必填)

  • Query parameters stated
  • Number of results
  • Databases searched
  • 已说明查询参数
  • 结果数量
  • 已搜索的数据库

Recommendations (Required)

建议(必填)

  • Best dataset for user's purpose (or "No suitable data found")
  • Data access notes
  • 符合用户需求的最佳数据集(或“未找到合适数据”)
  • 数据获取说明

Include Even If Empty

即使为空也需包含

  • Multi-omics studies section (or "No multi-omics studies found")
  • Data integration notes (or "Single-platform data, no integration needed")

  • 多组学研究章节(或“未找到多组学研究”)
  • 数据整合说明(或“单平台数据,无需整合”)

Common Use Cases

常见使用场景

Disease Gene Expression

疾病基因表达

User: "Find breast cancer RNA-seq data"
python
result = tu.tools.arrayexpress_search_experiments(
    keywords="breast cancer RNA-seq",
    species="Homo sapiens",
    limit=20
)
→ Report top experiments with quality assessment
用户:"查找乳腺癌RNA-seq数据"
python
result = tu.tools.arrayexpress_search_experiments(
    keywords="breast cancer RNA-seq",
    species="Homo sapiens",
    limit=20
)
→ 生成带质量评估的重点实验报告

Gene-Specific Studies

特定基因研究

User: "Find TP53 expression experiments in mouse"
python
result = tu.tools.arrayexpress_search_experiments(
    keywords="TP53 p53",  # Include aliases
    species="Mus musculus",
    limit=15
)
→ Report experiments studying this gene
用户:"查找小鼠中TP53的表达实验"
python
result = tu.tools.arrayexpress_search_experiments(
    keywords="TP53 p53",  # 包含别名
    species="Mus musculus",
    limit=15
)
→ 生成该基因相关实验报告

Specific Accession Lookup

特定登录号查询

User: "Get details for E-MTAB-5214" → Single experiment profile with all details and files
用户:"获取E-MTAB-5214的详情" → 生成包含所有详情和文件的单个实验档案

Multi-Omics Integration

多组学整合

User: "Find proteomics and transcriptomics studies for liver disease" → Search both ArrayExpress and BioStudies, note integration potential

用户:"查找肝病的蛋白质组学与转录组学研究" → 同时搜索ArrayExpress和BioStudies,标注整合潜力

Error Handling

错误处理

ErrorResponse
"No experiments found"Broaden keywords, remove species filter, try synonyms
"Accession not found"Verify format (E-MTAB-, E-GEOD-, S-BSST*), check if withdrawn
"Files not available"Note in report: "Data files restricted by submitter"
"API timeout"Retry once, then note: "(metadata retrieval incomplete)"

错误响应
"未找到实验"扩展关键词,移除物种筛选,尝试同义词
"登录号未找到"验证格式(E-MTAB-, E-GEOD-, S-BSST*),检查是否已撤回
"文件不可用"在报告中注明:"数据文件被提交者限制"
"API超时"重试一次,然后注明:"(元数据检索不完整)"

Tool Reference

工具参考

ArrayExpress (Gene Expression)
ToolPurpose
arrayexpress_search_experiments
Keyword/species search
arrayexpress_get_experiment_details
Full metadata
arrayexpress_get_experiment_files
Download links
arrayexpress_get_experiment_samples
Sample annotations
BioStudies (Multi-Omics)
ToolPurpose
biostudies_search_studies
Multi-omics search
biostudies_get_study_details
Study metadata
biostudies_get_study_files
Data files
biostudies_get_study_sections
Study structure

ArrayExpress(基因表达)
工具用途
arrayexpress_search_experiments
关键词/物种搜索
arrayexpress_get_experiment_details
获取完整元数据
arrayexpress_get_experiment_files
获取下载链接
arrayexpress_get_experiment_samples
获取样本注释
BioStudies(多组学)
工具用途
biostudies_search_studies
多组学搜索
biostudies_get_study_details
获取研究元数据
biostudies_get_study_files
获取数据文件
biostudies_get_study_sections
获取研究结构

Search Parameters Reference

搜索参数参考

ArrayExpress
ParameterDescriptionExample
keywords
Free text search"breast cancer RNA-seq"
species
Scientific name"Homo sapiens"
array
Platform filter"Illumina"
limit
Max results20
BioStudies
ParameterDescriptionExample
query
Free text"proteomics liver"
limit
Max results10
ArrayExpress
参数描述示例
keywords
自由文本搜索"breast cancer RNA-seq"
species
科学名称"Homo sapiens"
array
平台筛选"Illumina"
limit
最大结果数20
BioStudies
参数描述示例
query
自由文本"proteomics liver"
limit
最大结果数10