tooluniverse-metagenomics-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMetagenomics & Microbiome Analysis
宏基因组与微生物组分析
Integrated pipeline for exploring microbiome studies, classifying taxa, assessing genome quality, linking microbial composition to clinical phenotypes, and interpreting findings through pathway analysis and literature context.
Guiding principles:
- Study context first -- understand biome, sequencing method, and metadata before diving into taxa
- Taxonomic consistency -- GTDB taxonomy as reference standard; reconcile NCBI where needed
- Genome quality matters -- CheckM completeness/contamination thresholds determine trustworthy MAGs
- Interpretation over enumeration -- explain what taxa mean for the biological question
- English-first queries -- use English terms in tool calls
用于探索微生物组研究、分类类群、评估基因组质量、关联微生物组成与临床表型,并通过通路分析和文献背景解读研究结果的整合流程。
指导原则:
- 先明确研究背景 —— 在深入研究类群之前,先了解生物群系、测序方法和元数据
- 分类学一致性 —— 以GTDB分类系统作为参考标准;必要时协调NCBI分类
- 基因组质量至关重要 —— CheckM的完整性/污染阈值决定了宏基因组组装基因组(MAGs)的可信度
- 解读优先于枚举 —— 解释类群对生物学问题的意义
- 优先使用英文查询 —— 在工具调用中使用英文术语
LOOK UP, DON'T GUESS
查资料,勿猜测
When uncertain about any scientific fact, SEARCH databases first rather than reasoning from memory.
当对任何科学事实不确定时,先搜索数据库,而非凭记忆推断。
COMPUTE, DON'T DESCRIBE
去计算,勿描述
When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it.
当分析需要计算(统计、数据处理、评分、富集分析)时,通过Bash编写并运行Python代码。不要描述你会做什么——直接执行并报告实际结果。使用ToolUniverse工具检索数据,然后用Python(pandas、scipy、statsmodels、matplotlib)进行分析。
Core Databases
核心数据库
| Database | Best For |
|---|---|
| MGnify | Processed metagenomics studies, taxonomic/functional results |
| GTDB | Standardized bacterial/archaeal taxonomy, species-level resolution |
| GMrepo | Gut species-to-human-health phenotype associations |
| ENA | Raw sequencing datasets and study metadata |
| KEGG | Pathway mapping for microbial functional annotations |
| PubMed/EuropePMC | Published microbiome-disease studies |
| CTD | Chemical-microbiome-disease relationships |
| 数据库 | 适用场景 |
|---|---|
| MGnify | 已处理的宏基因组研究、分类学/功能分析结果 |
| GTDB | 标准化的细菌/古菌分类系统、物种级分辨率 |
| GMrepo | 肠道物种与人类健康表型的关联 |
| ENA | 原始测序数据集和研究元数据 |
| KEGG | 微生物功能注释的通路映射 |
| PubMed/EuropePMC | 已发表的微生物组-疾病相关研究 |
| CTD | 化学物质-微生物组-疾病的关联关系 |
Workflow
工作流程
Phase 0: Parse query → organism, biome, phenotype, or accession
Phase 1: Study Discovery → MGnify_search_studies, ENAPortal_search_studies
Phase 2: Taxonomic Classification → GTDB_search_genomes, GTDB_get_species, GTDB_search_taxon
Phase 3: Genome Quality → MGnify_search_genomes, MGnify_get_genome (CheckM metrics)
Phase 4: Functional Annotation → MGnify GO terms + KEGG pathway mapping
Phase 5: Clinical Associations → GMrepo species-phenotype links
Phase 6: Literature → PubMed/EuropePMC + CTD gene-disease
Phase 7: Interpretation & Report SynthesisPhase 0: Parse query → organism, biome, phenotype, or accession
Phase 1: Study Discovery → MGnify_search_studies, ENAPortal_search_studies
Phase 2: Taxonomic Classification → GTDB_search_genomes, GTDB_get_species, GTDB_search_taxon
Phase 3: Genome Quality → MGnify_search_genomes, MGnify_get_genome (CheckM metrics)
Phase 4: Functional Annotation → MGnify GO terms + KEGG pathway mapping
Phase 5: Clinical Associations → GMrepo species-phenotype links
Phase 6: Literature → PubMed/EuropePMC + CTD gene-disease
Phase 7: Interpretation & Report SynthesisKey Phase Notes
关键阶段说明
Phase 1: ENA requires structured queries (e.g., ), not free text. If ENA fails, fall back to MGnify.
study_title="*IBD*"Phase 2: GTDB uses its own naming (e.g., vs NCBI ). Always note discrepancies. Use .
s__Bacteroides_A fragilisBacteroides fragilisGTDB_search_taxon(operation="search_taxon", query=name)Phase 3 - Quality tiers (MIMAG):
- High: >= 90% complete, <= 5% contamination, rRNA + >= 18 tRNAs
- Medium: >= 50% complete, <= 10% contamination
- Low: below medium -- flag but don't exclude
Phase 4 - Functional interpretation: Don't just list GO terms. Connect to biology:
| Functional Category | Key KEGG Pathways | Significance |
|---|---|---|
| SCFA production | map00650, map00640 | Gut barrier, anti-inflammatory |
| LPS biosynthesis | map00540 | Pro-inflammatory, endotoxemia |
| Bile acid metabolism | map00120 | Fat absorption, FXR signaling |
| Tryptophan metabolism | map00380 | Serotonin, AhR, immune |
| Vitamin biosynthesis | map00730/740/760 | Host nutritional contribution |
Use (NOT ). Pathway IDs need organism prefix (, , ), NOT bare .
kegg_search_pathway(keyword=...)queryhsakoecomapPhase 5: GMrepo uses MeSH terms: "Crohn Disease" not "IBD", "Colitis, Ulcerative" not "UC", "Colorectal Neoplasms" not "colorectal cancer". Try NCBI taxon IDs if species name fails.
Phase 6 - Evidence grading:
- Strong: Meta-analysis or >5 studies, consistent direction
- Moderate: 2-5 studies consistent, or 1 large cohort
- Preliminary: Single study or conflicting
- Mechanistic only: In vitro/animal, no human epidemiology
Phase 7 - Report: Executive summary, study landscape, GTDB taxonomy, functional interpretation (not GO term lists), clinical relevance with evidence grades, mechanistic model, genome catalog with quality tiers, data gaps.
Phase 1:ENA需要结构化查询(例如 ),而非自由文本。若ENA查询失败,可退而使用MGnify。
study_title="*IBD*"Phase 2:GTDB使用自有命名规则(例如 对应NCBI的 )。需始终标注差异。使用 。
s__Bacteroides_A fragilisBacteroides fragilisGTDB_search_taxon(operation="search_taxon", query=name)Phase 3 - 质量等级(MIMAG标准):
- 高:完整性≥90%,污染率≤5%,包含rRNA + ≥18个tRNA
- 中:完整性≥50%,污染率≤10%
- 低:低于中等标准——标记但不排除
Phase 4 - 功能解读:不要仅罗列GO术语,需关联生物学意义:
| 功能类别 | 关键KEGG通路 | 意义 |
|---|---|---|
| 短链脂肪酸(SCFA)合成 | map00650、map00640 | 肠道屏障、抗炎 |
| LPS生物合成 | map00540 | 促炎、内毒素血症 |
| 胆汁酸代谢 | map00120 | 脂肪吸收、FXR信号通路 |
| 色氨酸代谢 | map00380 | 血清素、AhR、免疫 |
| 维生素合成 | map00730/740/760 | 对宿主营养的贡献 |
使用 (而非)。通路ID需要添加物种前缀(、、),不能仅使用开头的裸ID。
kegg_search_pathway(keyword=...)queryhsakoecomapPhase 5:GMrepo使用MeSH术语:需用"Crohn Disease"而非"IBD","Colitis, Ulcerative"而非"UC","Colorectal Neoplasms"而非"colorectal cancer"。若物种名称查询失败,可尝试使用NCBI分类ID。
Phase 6 - 证据分级:
- 强:荟萃分析或≥5项研究,结果方向一致
- 中:2-5项研究结果一致,或1项大型队列研究
- 初步:仅1项研究或结果存在冲突
- 仅机制研究:体外/动物实验,无人类流行病学数据
Phase 7 - 报告:执行摘要、研究概况、GTDB分类结果、功能解读(非GO术语列表)、带证据等级的临床相关性、机制模型、带质量等级的基因组目录、数据缺口。
Edge Cases & Fallbacks
边缘情况与备选方案
- Taxon not in GTDB: Try partial search or fall back to MGnify (NCBI taxonomy)
- No GMrepo data: Normal for non-gut organisms; use literature
- GMrepo 0 results: Use formal MeSH terms or NCBI taxon IDs
- No KEGG match: Check MetaCyc or literature
- 类群未收录于GTDB:尝试部分搜索或退而使用MGnify(NCBI分类系统)
- 无GMrepo数据:非肠道类群属正常情况;可使用文献资料
- GMrepo无结果:使用规范的MeSH术语或NCBI分类ID
- 无KEGG匹配结果:查看MetaCyc或文献资料
Limitations
局限性
- GMrepo: Gut-only
- GTDB: Bacteria/Archaea only
- ENA: Raw data only, strict query syntax
- No sequence analysis: Queries databases, not raw FASTQ/FASTA
- GMrepo:仅适用于肠道微生物
- GTDB:仅覆盖细菌/古菌
- ENA:仅提供原始数据,查询语法严格
- 无序列分析功能:仅查询数据库,不处理原始FASTQ/FASTA序列