tooluniverse-metagenomics-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Metagenomics & Microbiome Analysis

宏基因组与微生物组分析

Integrated pipeline for exploring microbiome studies, classifying taxa, assessing genome quality, linking microbial composition to clinical phenotypes, and interpreting findings through pathway analysis and literature context.
Guiding principles:
  1. Study context first -- understand biome, sequencing method, and metadata before diving into taxa
  2. Taxonomic consistency -- GTDB taxonomy as reference standard; reconcile NCBI where needed
  3. Genome quality matters -- CheckM completeness/contamination thresholds determine trustworthy MAGs
  4. Interpretation over enumeration -- explain what taxa mean for the biological question
  5. English-first queries -- use English terms in tool calls
用于探索微生物组研究、分类类群、评估基因组质量、关联微生物组成与临床表型,并通过通路分析和文献背景解读研究结果的整合流程。
指导原则:
  1. 先明确研究背景 —— 在深入研究类群之前,先了解生物群系、测序方法和元数据
  2. 分类学一致性 —— 以GTDB分类系统作为参考标准;必要时协调NCBI分类
  3. 基因组质量至关重要 —— CheckM的完整性/污染阈值决定了宏基因组组装基因组(MAGs)的可信度
  4. 解读优先于枚举 —— 解释类群对生物学问题的意义
  5. 优先使用英文查询 —— 在工具调用中使用英文术语

LOOK UP, DON'T GUESS

查资料,勿猜测

When uncertain about any scientific fact, SEARCH databases first rather than reasoning from memory.

当对任何科学事实不确定时,先搜索数据库,而非凭记忆推断。

COMPUTE, DON'T DESCRIBE

去计算,勿描述

When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it.
当分析需要计算(统计、数据处理、评分、富集分析)时,通过Bash编写并运行Python代码。不要描述你会做什么——直接执行并报告实际结果。使用ToolUniverse工具检索数据,然后用Python(pandas、scipy、statsmodels、matplotlib)进行分析。

Core Databases

核心数据库

DatabaseBest For
MGnifyProcessed metagenomics studies, taxonomic/functional results
GTDBStandardized bacterial/archaeal taxonomy, species-level resolution
GMrepoGut species-to-human-health phenotype associations
ENARaw sequencing datasets and study metadata
KEGGPathway mapping for microbial functional annotations
PubMed/EuropePMCPublished microbiome-disease studies
CTDChemical-microbiome-disease relationships

数据库适用场景
MGnify已处理的宏基因组研究、分类学/功能分析结果
GTDB标准化的细菌/古菌分类系统、物种级分辨率
GMrepo肠道物种与人类健康表型的关联
ENA原始测序数据集和研究元数据
KEGG微生物功能注释的通路映射
PubMed/EuropePMC已发表的微生物组-疾病相关研究
CTD化学物质-微生物组-疾病的关联关系

Workflow

工作流程

Phase 0: Parse query → organism, biome, phenotype, or accession
Phase 1: Study Discovery → MGnify_search_studies, ENAPortal_search_studies
Phase 2: Taxonomic Classification → GTDB_search_genomes, GTDB_get_species, GTDB_search_taxon
Phase 3: Genome Quality → MGnify_search_genomes, MGnify_get_genome (CheckM metrics)
Phase 4: Functional Annotation → MGnify GO terms + KEGG pathway mapping
Phase 5: Clinical Associations → GMrepo species-phenotype links
Phase 6: Literature → PubMed/EuropePMC + CTD gene-disease
Phase 7: Interpretation & Report Synthesis

Phase 0: Parse query → organism, biome, phenotype, or accession
Phase 1: Study Discovery → MGnify_search_studies, ENAPortal_search_studies
Phase 2: Taxonomic Classification → GTDB_search_genomes, GTDB_get_species, GTDB_search_taxon
Phase 3: Genome Quality → MGnify_search_genomes, MGnify_get_genome (CheckM metrics)
Phase 4: Functional Annotation → MGnify GO terms + KEGG pathway mapping
Phase 5: Clinical Associations → GMrepo species-phenotype links
Phase 6: Literature → PubMed/EuropePMC + CTD gene-disease
Phase 7: Interpretation & Report Synthesis

Key Phase Notes

关键阶段说明

Phase 1: ENA requires structured queries (e.g.,
study_title="*IBD*"
), not free text. If ENA fails, fall back to MGnify.
Phase 2: GTDB uses its own naming (e.g.,
s__Bacteroides_A fragilis
vs NCBI
Bacteroides fragilis
). Always note discrepancies. Use
GTDB_search_taxon(operation="search_taxon", query=name)
.
Phase 3 - Quality tiers (MIMAG):
  • High: >= 90% complete, <= 5% contamination, rRNA + >= 18 tRNAs
  • Medium: >= 50% complete, <= 10% contamination
  • Low: below medium -- flag but don't exclude
Phase 4 - Functional interpretation: Don't just list GO terms. Connect to biology:
Functional CategoryKey KEGG PathwaysSignificance
SCFA productionmap00650, map00640Gut barrier, anti-inflammatory
LPS biosynthesismap00540Pro-inflammatory, endotoxemia
Bile acid metabolismmap00120Fat absorption, FXR signaling
Tryptophan metabolismmap00380Serotonin, AhR, immune
Vitamin biosynthesismap00730/740/760Host nutritional contribution
Use
kegg_search_pathway(keyword=...)
(NOT
query
). Pathway IDs need organism prefix (
hsa
,
ko
,
eco
), NOT bare
map
.
Phase 5: GMrepo uses MeSH terms: "Crohn Disease" not "IBD", "Colitis, Ulcerative" not "UC", "Colorectal Neoplasms" not "colorectal cancer". Try NCBI taxon IDs if species name fails.
Phase 6 - Evidence grading:
  • Strong: Meta-analysis or >5 studies, consistent direction
  • Moderate: 2-5 studies consistent, or 1 large cohort
  • Preliminary: Single study or conflicting
  • Mechanistic only: In vitro/animal, no human epidemiology
Phase 7 - Report: Executive summary, study landscape, GTDB taxonomy, functional interpretation (not GO term lists), clinical relevance with evidence grades, mechanistic model, genome catalog with quality tiers, data gaps.

Phase 1:ENA需要结构化查询(例如
study_title="*IBD*"
),而非自由文本。若ENA查询失败,可退而使用MGnify。
Phase 2:GTDB使用自有命名规则(例如
s__Bacteroides_A fragilis
对应NCBI的
Bacteroides fragilis
)。需始终标注差异。使用
GTDB_search_taxon(operation="search_taxon", query=name)
Phase 3 - 质量等级(MIMAG标准):
  • :完整性≥90%,污染率≤5%,包含rRNA + ≥18个tRNA
  • :完整性≥50%,污染率≤10%
  • :低于中等标准——标记但不排除
Phase 4 - 功能解读:不要仅罗列GO术语,需关联生物学意义:
功能类别关键KEGG通路意义
短链脂肪酸(SCFA)合成map00650、map00640肠道屏障、抗炎
LPS生物合成map00540促炎、内毒素血症
胆汁酸代谢map00120脂肪吸收、FXR信号通路
色氨酸代谢map00380血清素、AhR、免疫
维生素合成map00730/740/760对宿主营养的贡献
使用
kegg_search_pathway(keyword=...)
(而非
query
)。通路ID需要添加物种前缀(
hsa
ko
eco
),不能仅使用
map
开头的裸ID。
Phase 5:GMrepo使用MeSH术语:需用"Crohn Disease"而非"IBD","Colitis, Ulcerative"而非"UC","Colorectal Neoplasms"而非"colorectal cancer"。若物种名称查询失败,可尝试使用NCBI分类ID。
Phase 6 - 证据分级
  • :荟萃分析或≥5项研究,结果方向一致
  • :2-5项研究结果一致,或1项大型队列研究
  • 初步:仅1项研究或结果存在冲突
  • 仅机制研究:体外/动物实验,无人类流行病学数据
Phase 7 - 报告:执行摘要、研究概况、GTDB分类结果、功能解读(非GO术语列表)、带证据等级的临床相关性、机制模型、带质量等级的基因组目录、数据缺口。

Edge Cases & Fallbacks

边缘情况与备选方案

  • Taxon not in GTDB: Try partial search or fall back to MGnify (NCBI taxonomy)
  • No GMrepo data: Normal for non-gut organisms; use literature
  • GMrepo 0 results: Use formal MeSH terms or NCBI taxon IDs
  • No KEGG match: Check MetaCyc or literature
  • 类群未收录于GTDB:尝试部分搜索或退而使用MGnify(NCBI分类系统)
  • 无GMrepo数据:非肠道类群属正常情况;可使用文献资料
  • GMrepo无结果:使用规范的MeSH术语或NCBI分类ID
  • 无KEGG匹配结果:查看MetaCyc或文献资料

Limitations

局限性

  • GMrepo: Gut-only
  • GTDB: Bacteria/Archaea only
  • ENA: Raw data only, strict query syntax
  • No sequence analysis: Queries databases, not raw FASTQ/FASTA
  • GMrepo:仅适用于肠道微生物
  • GTDB:仅覆盖细菌/古菌
  • ENA:仅提供原始数据,查询语法严格
  • 无序列分析功能:仅查询数据库,不处理原始FASTQ/FASTA序列