tooluniverse-expression-data-retrieval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGene Expression & Omics Data Retrieval
基因表达与组学数据检索
Retrieve gene expression experiments and multi-omics datasets with proper disambiguation and quality assessment.
IMPORTANT: Always use English terms in tool calls (gene names, tissue names, condition descriptions), even if the user writes in another language. Only try original-language terms as a fallback if English returns no results. Respond in the user's language.
通过合理的消歧和质量评估,检索基因表达实验与多组学数据集。
重要提示:在工具调用中始终使用英文术语(基因名称、组织名称、条件描述),即使用户使用其他语言提问。仅当英文检索无结果时,才尝试使用原语言术语作为备选。请用用户的语言回复。
Workflow Overview
工作流概述
Phase 0: Clarify Query (if ambiguous)
↓
Phase 1: Disambiguate Gene/Condition
↓
Phase 2: Search & Retrieve (Internal)
↓
Phase 3: Report Dataset Profile阶段0:澄清查询(若存在歧义)
↓
阶段1:基因/条件消歧
↓
阶段2:搜索与检索(内部操作)
↓
阶段3:生成数据集档案报告Phase 0: Clarification (When Needed)
阶段0:澄清(必要时)
Ask the user ONLY if:
- Gene name is ambiguous (e.g., "p53" → TP53 or MDM2 studies?)
- Tissue/condition unclear for comparative studies
- Organism not specified for non-human research
Skip clarification for:
- Specific accession numbers (E-MTAB-, E-GEOD-, S-BSST*)
- Clear disease/tissue + organism combinations
- Explicit platform requests (RNA-seq, microarray)
仅在以下情况询问用户:
- 基因名称存在歧义(例如:"p53" → 是TP53还是MDM2相关研究?)
- 比较研究中的组织/条件不明确
- 非人类研究未指定物种
以下情况无需澄清:
- 特定登录号(E-MTAB-, E-GEOD-, S-BSST*)
- 清晰的疾病/组织+物种组合
- 明确的平台请求(RNA-seq、微阵列)
Phase 1: Query Disambiguation
阶段1:查询消歧
1.1 Gene Name Resolution
1.1 基因名称解析
If searching by gene, first resolve official identifiers:
python
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()若按基因搜索,首先解析官方标识符:
python
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()For gene-focused searches, resolve official symbol first
针对基因相关搜索,先解析官方符号
This helps construct better search queries
这有助于构建更精准的搜索查询
Example: "p53" → "TP53" (official HGNC symbol)
示例:"p53" → "TP53"(HGNC官方符号)
**Gene Disambiguation Checklist:**
- [ ] Official gene symbol identified (HGNC for human, MGI for mouse)
- [ ] Common aliases noted for search expansion
- [ ] Species confirmed
**基因消歧检查清单:**
- [ ] 已识别官方基因符号(人类用HGNC,小鼠用MGI)
- [ ] 已记录常用别名以扩展搜索范围
- [ ] 已确认物种1.2 Construct Search Strategy
1.2 构建搜索策略
| User Query Type | Search Strategy |
|---|---|
| Specific accession | Direct retrieval |
| Gene + condition | "[gene] [condition]" + species filter |
| Disease only | "[disease]" + species filter |
| Technology-specific | Add platform keywords (RNA-seq, microarray) |
| 用户查询类型 | 搜索策略 |
|---|---|
| 特定登录号 | 直接检索 |
| 基因+条件 | "[基因] [条件]" + 物种筛选 |
| 仅疾病 | "[疾病]" + 物种筛选 |
| 特定技术 | 添加平台关键词(RNA-seq、微阵列) |
Phase 2: Data Retrieval (Internal)
阶段2:数据检索(内部操作)
Search silently. Do NOT narrate the process.
静默执行搜索。请勿描述搜索过程。
2.1 Search Experiments
2.1 搜索实验
python
undefinedpython
undefinedArrayExpress search
ArrayExpress搜索
result = tu.tools.arrayexpress_search_experiments(
keywords="[gene/disease] [condition]",
species="[species]",
limit=20
)
result = tu.tools.arrayexpress_search_experiments(
keywords="[gene/disease] [condition]",
species="[species]",
limit=20
)
BioStudies for multi-omics
BioStudies多组学搜索
biostudies_result = tu.tools.biostudies_search_studies(
query="[keywords]",
limit=10
)
undefinedbiostudies_result = tu.tools.biostudies_search_studies(
query="[keywords]",
limit=10
)
undefined2.2 Get Experiment Details
2.2 获取实验详情
For top results, retrieve full metadata:
python
undefined针对顶部结果,检索完整元数据:
python
undefinedGet details for each relevant experiment
获取每个相关实验的详情
details = tu.tools.arrayexpress_get_experiment_details(
accession=accession
)
details = tu.tools.arrayexpress_get_experiment_details(
accession=accession
)
Get sample information
获取样本信息
samples = tu.tools.arrayexpress_get_experiment_samples(
accession=accession
)
samples = tu.tools.arrayexpress_get_experiment_samples(
accession=accession
)
Get available files
获取可用文件
files = tu.tools.arrayexpress_get_experiment_files(
accession=accession
)
undefinedfiles = tu.tools.arrayexpress_get_experiment_files(
accession=accession
)
undefined2.3 BioStudies Retrieval
2.3 BioStudies检索
python
undefinedpython
undefinedMulti-omics study details
多组学研究详情
study_details = tu.tools.biostudies_get_study_details(
accession=study_accession
)
study_details = tu.tools.biostudies_get_study_details(
accession=study_accession
)
Study structure
研究结构
sections = tu.tools.biostudies_get_study_sections(
accession=study_accession
)
sections = tu.tools.biostudies_get_study_sections(
accession=study_accession
)
Available files
可用文件
files = tu.tools.biostudies_get_study_files(
accession=study_accession
)
undefinedfiles = tu.tools.biostudies_get_study_files(
accession=study_accession
)
undefinedFallback Chains
备选检索链
| Primary | Fallback | Notes |
|---|---|---|
| ArrayExpress search | BioStudies search | ArrayExpress empty |
| arrayexpress_get_experiment_details | biostudies_get_study_details | E-GEOD may have BioStudies mirror |
| arrayexpress_get_experiment_files | Note "Files unavailable" | Some studies restrict downloads |
| 主检索方式 | 备选方式 | 说明 |
|---|---|---|
| ArrayExpress搜索 | BioStudies搜索 | ArrayExpress无结果时 |
| arrayexpress_get_experiment_details | biostudies_get_study_details | E-GEOD可能在BioStudies有镜像 |
| arrayexpress_get_experiment_files | 标注“文件不可用” | 部分研究限制下载 |
Phase 3: Report Dataset Profile
阶段3:生成数据集档案报告
Output Structure
输出结构
Present as a Dataset Search Report. Hide search process.
markdown
undefined以数据集搜索报告形式呈现。隐藏搜索过程。
markdown
undefinedExpression Data: [Query Topic]
表达数据:[查询主题]
Search Summary
- Query: [gene/disease] in [species]
- Databases: ArrayExpress, BioStudies
- Results: [N] relevant experiments found
Data Quality Overview: [assessment based on criteria below]
搜索摘要
- 查询:[基因/疾病] in [物种]
- 数据库:ArrayExpress、BioStudies
- 结果:找到[N]个相关实验
数据质量概览:[基于以下标准的评估结果]
Top Experiments
重点实验
1. [E-MTAB-XXXX]: [Title]
1. [E-MTAB-XXXX]: [标题]
| Attribute | Value |
|---|---|
| Accession | [accession with link] |
| Organism | [species] |
| Experiment Type | RNA-seq / Microarray |
| Platform | [specific platform] |
| Samples | [N] samples |
| Release Date | [date] |
Description: [Brief description from metadata]
Experimental Design:
- Conditions: [treatment vs control, etc.]
- Replicates: [N biological, M technical]
- Tissue/Cell type: [if specified]
Sample Groups:
| Group | Samples | Description |
|---|---|---|
| Control | [N] | [description] |
| Treatment | [N] | [description] |
Data Files Available:
| File | Type | Size |
|---|---|---|
| [filename] | Processed data | [size] |
| [filename] | Raw data | [size] |
| [filename] | Sample metadata | [size] |
Quality Assessment: ●●● High / ●●○ Medium / ●○○ Low
- Sample size: [adequate/limited]
- Replication: [yes/no]
- Metadata completeness: [complete/partial]
| 属性 | 值 |
|---|---|
| 登录号 | [带链接的登录号] |
| 物种 | [物种] |
| 实验类型 | RNA-seq / 微阵列 |
| 平台 | [具体平台] |
| 样本数 | [N]个样本 |
| 发布日期 | [日期] |
描述:[来自元数据的简要描述]
实验设计:
- 条件:[处理组 vs 对照组等]
- 重复:[N个生物学重复,M个技术重复]
- 组织/细胞类型:[若有指定]
样本分组:
| 分组 | 样本数 | 描述 |
|---|---|---|
| 对照组 | [N] | [描述] |
| 处理组 | [N] | [描述] |
可用数据文件:
| 文件 | 类型 | 大小 |
|---|---|---|
| [文件名] | 处理后数据 | [大小] |
| [文件名] | 原始数据 | [大小] |
| [文件名] | 样本元数据 | [大小] |
质量评估:●●● 高 / ●●○ 中 / ●○○ 低
- 样本量:[充足/有限]
- 重复:[是/否]
- 元数据完整性:[完整/部分]
2. [E-GEOD-XXXXX]: [Title]
2. [E-GEOD-XXXXX]: [标题]
[Same structure as above]
[与上述结构一致]
Multi-Omics Studies (from BioStudies)
多组学研究(来自BioStudies)
[S-BSST-XXXXX]: [Title]
[S-BSST-XXXXX]: [标题]
| Attribute | Value |
|---|---|
| Accession | [accession] |
| Study Type | [proteomics/metabolomics/integrated] |
| Organism | [species] |
| Samples | [N] |
Data Types Included:
- Transcriptomics
- Proteomics
- Metabolomics
- Other: [specify]
| 属性 | 值 |
|---|---|
| 登录号 | [登录号] |
| 研究类型 | 蛋白质组学/代谢组学/整合组学 |
| 物种 | [物种] |
| 样本数 | [N] |
包含的数据类型:
- 转录组学
- 蛋白质组学
- 代谢组学
- 其他:[说明]
Summary Table
汇总表
| Accession | Type | Samples | Platform | Quality |
|---|---|---|---|---|
| [E-MTAB-X] | RNA-seq | [N] | Illumina | ●●● |
| [E-GEOD-X] | Microarray | [N] | Affymetrix | ●●○ |
| 登录号 | 类型 | 样本数 | 平台 | 质量 |
|---|---|---|---|---|
| [E-MTAB-X] | RNA-seq | [N] | Illumina | ●●● |
| [E-GEOD-X] | 微阵列 | [N] | Affymetrix | ●●○ |
Recommendations
建议
For [specific analysis type]:
- Best experiment: [accession] - [reason]
- Alternative: [accession] - [reason]
Data Integration Notes:
- Platform compatibility: [notes on combining datasets]
- Batch considerations: [if applicable]
针对[特定分析类型]:
- 最佳数据集:[登录号] - [理由]
- 备选数据集:[登录号] - [理由]
数据整合说明:
- 平台兼容性:[关于数据集合并的说明]
- 批次效应考虑:[若适用]
Data Access
数据获取
Direct Download Links
直接下载链接
- E-MTAB-XXXX processed data
- E-MTAB-XXXX raw data
- E-MTAB-XXXX处理后数据
- E-MTAB-XXXX原始数据
Database Links
数据库链接
- ArrayExpress: https://www.ebi.ac.uk/arrayexpress/experiments/[accession]
- BioStudies: https://www.ebi.ac.uk/biostudies/studies/[accession]
Retrieved: [date]
---- ArrayExpress: https://www.ebi.ac.uk/arrayexpress/experiments/[accession]
- BioStudies: https://www.ebi.ac.uk/biostudies/studies/[accession]
检索日期: [日期]
---Data Quality Tiers
数据质量等级
Assessment criteria for expression experiments:
| Tier | Symbol | Criteria |
|---|---|---|
| High Quality | ●●● | ≥3 bio replicates, complete metadata, processed data available |
| Medium Quality | ●●○ | 2-3 replicates OR some metadata gaps, data accessible |
| Low Quality | ●○○ | No replicates, sparse metadata, or data access issues |
| Use with Caution | ○○○ | Single sample, no replication, outdated platform |
Include assessment rationale:
markdown
**Quality**: ●●● High
- ✓ 4 biological replicates per condition
- ✓ Complete sample annotations
- ✓ Processed and raw data available
- ✓ Recent RNA-seq platform表达实验评估标准:
| 等级 | 符号 | 标准 |
|---|---|---|
| 高质量 | ●●● | ≥3个生物学重复,元数据完整,可获取处理后数据 |
| 中质量 | ●●○ | 2-3个重复 或 部分元数据缺失,数据可访问 |
| 低质量 | ●○○ | 无重复,元数据稀疏,或数据访问受限 |
| 谨慎使用 | ○○○ | 单样本,无重复,平台过时 |
需包含评估依据:
markdown
**质量**:●●● 高
- ✓ 每个条件含4个生物学重复
- ✓ 样本注释完整
- ✓ 可获取处理后及原始数据
- ✓ 使用最新RNA-seq平台Completeness Checklist
完整性检查清单
Every dataset report MUST include:
每份数据集报告必须包含:
Per Experiment (Required)
单个实验(必填)
- Accession number with database link
- Organism
- Experiment type (RNA-seq/microarray/etc.)
- Sample count
- Brief description
- Quality assessment
- 带数据库链接的登录号
- 物种
- 实验类型(RNA-seq/微阵列等)
- 样本数量
- 简要描述
- 质量评估
Search Summary (Required)
搜索摘要(必填)
- Query parameters stated
- Number of results
- Databases searched
- 已说明查询参数
- 结果数量
- 已搜索的数据库
Recommendations (Required)
建议(必填)
- Best dataset for user's purpose (or "No suitable data found")
- Data access notes
- 符合用户需求的最佳数据集(或“未找到合适数据”)
- 数据获取说明
Include Even If Empty
即使为空也需包含
- Multi-omics studies section (or "No multi-omics studies found")
- Data integration notes (or "Single-platform data, no integration needed")
- 多组学研究章节(或“未找到多组学研究”)
- 数据整合说明(或“单平台数据,无需整合”)
Common Use Cases
常见使用场景
Disease Gene Expression
疾病基因表达
User: "Find breast cancer RNA-seq data"
python
result = tu.tools.arrayexpress_search_experiments(
keywords="breast cancer RNA-seq",
species="Homo sapiens",
limit=20
)→ Report top experiments with quality assessment
用户:"查找乳腺癌RNA-seq数据"
python
result = tu.tools.arrayexpress_search_experiments(
keywords="breast cancer RNA-seq",
species="Homo sapiens",
limit=20
)→ 生成带质量评估的重点实验报告
Gene-Specific Studies
特定基因研究
User: "Find TP53 expression experiments in mouse"
python
result = tu.tools.arrayexpress_search_experiments(
keywords="TP53 p53", # Include aliases
species="Mus musculus",
limit=15
)→ Report experiments studying this gene
用户:"查找小鼠中TP53的表达实验"
python
result = tu.tools.arrayexpress_search_experiments(
keywords="TP53 p53", # 包含别名
species="Mus musculus",
limit=15
)→ 生成该基因相关实验报告
Specific Accession Lookup
特定登录号查询
User: "Get details for E-MTAB-5214"
→ Single experiment profile with all details and files
用户:"获取E-MTAB-5214的详情"
→ 生成包含所有详情和文件的单个实验档案
Multi-Omics Integration
多组学整合
User: "Find proteomics and transcriptomics studies for liver disease"
→ Search both ArrayExpress and BioStudies, note integration potential
用户:"查找肝病的蛋白质组学与转录组学研究"
→ 同时搜索ArrayExpress和BioStudies,标注整合潜力
Error Handling
错误处理
| Error | Response |
|---|---|
| "No experiments found" | Broaden keywords, remove species filter, try synonyms |
| "Accession not found" | Verify format (E-MTAB-, E-GEOD-, S-BSST*), check if withdrawn |
| "Files not available" | Note in report: "Data files restricted by submitter" |
| "API timeout" | Retry once, then note: "(metadata retrieval incomplete)" |
| 错误 | 响应 |
|---|---|
| "未找到实验" | 扩展关键词,移除物种筛选,尝试同义词 |
| "登录号未找到" | 验证格式(E-MTAB-, E-GEOD-, S-BSST*),检查是否已撤回 |
| "文件不可用" | 在报告中注明:"数据文件被提交者限制" |
| "API超时" | 重试一次,然后注明:"(元数据检索不完整)" |
Tool Reference
工具参考
ArrayExpress (Gene Expression)
| Tool | Purpose |
|---|---|
| Keyword/species search |
| Full metadata |
| Download links |
| Sample annotations |
BioStudies (Multi-Omics)
| Tool | Purpose |
|---|---|
| Multi-omics search |
| Study metadata |
| Data files |
| Study structure |
ArrayExpress(基因表达)
| 工具 | 用途 |
|---|---|
| 关键词/物种搜索 |
| 获取完整元数据 |
| 获取下载链接 |
| 获取样本注释 |
BioStudies(多组学)
| 工具 | 用途 |
|---|---|
| 多组学搜索 |
| 获取研究元数据 |
| 获取数据文件 |
| 获取研究结构 |
Search Parameters Reference
搜索参数参考
ArrayExpress
| Parameter | Description | Example |
|---|---|---|
| Free text search | "breast cancer RNA-seq" |
| Scientific name | "Homo sapiens" |
| Platform filter | "Illumina" |
| Max results | 20 |
BioStudies
| Parameter | Description | Example |
|---|---|---|
| Free text | "proteomics liver" |
| Max results | 10 |
ArrayExpress
| 参数 | 描述 | 示例 |
|---|---|---|
| 自由文本搜索 | "breast cancer RNA-seq" |
| 科学名称 | "Homo sapiens" |
| 平台筛选 | "Illumina" |
| 最大结果数 | 20 |
BioStudies
| 参数 | 描述 | 示例 |
|---|---|---|
| 自由文本 | "proteomics liver" |
| 最大结果数 | 10 |