tooluniverse-metagenomics-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Metagenomics & Microbiome Analysis

宏基因组与微生物组分析

Integrated pipeline for exploring microbiome studies, classifying taxa, assessing genome quality, linking microbial composition to clinical phenotypes, and interpreting findings through pathway analysis and literature context.

Guiding principles:

Study context first -- understand biome, sequencing method, and metadata before diving into taxa
Taxonomic consistency -- GTDB taxonomy as reference standard; reconcile NCBI where needed
Genome quality matters -- CheckM completeness/contamination thresholds determine trustworthy MAGs
Interpretation over enumeration -- explain what taxa mean for the biological question
English-first queries -- use English terms in tool calls

用于探索微生物组研究、分类类群、评估基因组质量、关联微生物组成与临床表型，并通过通路分析和文献背景解读研究结果的整合流程。

指导原则:

先明确研究背景 —— 在深入研究类群之前，先了解生物群系、测序方法和元数据
分类学一致性 —— 以GTDB分类系统作为参考标准；必要时协调NCBI分类
基因组质量至关重要 —— CheckM的完整性/污染阈值决定了宏基因组组装基因组（MAGs）的可信度
解读优先于枚举 —— 解释类群对生物学问题的意义
优先使用英文查询 —— 在工具调用中使用英文术语

LOOK UP, DON'T GUESS

查资料，勿猜测

When uncertain about any scientific fact, SEARCH databases first rather than reasoning from memory.

当对任何科学事实不确定时，先搜索数据库，而非凭记忆推断。

COMPUTE, DON'T DESCRIBE

去计算，勿描述

When analysis requires computation (statistics, data processing, scoring, enrichment), write and run Python code via Bash. Don't describe what you would do — execute it and report actual results. Use ToolUniverse tools to retrieve data, then Python (pandas, scipy, statsmodels, matplotlib) to analyze it.

当分析需要计算（统计、数据处理、评分、富集分析）时，通过Bash编写并运行Python代码。不要描述你会做什么——直接执行并报告实际结果。使用ToolUniverse工具检索数据，然后用Python（pandas、scipy、statsmodels、matplotlib）进行分析。

Core Databases

核心数据库

Database	Best For
MGnify	Processed metagenomics studies, taxonomic/functional results
GTDB	Standardized bacterial/archaeal taxonomy, species-level resolution
GMrepo	Gut species-to-human-health phenotype associations
ENA	Raw sequencing datasets and study metadata
KEGG	Pathway mapping for microbial functional annotations
PubMed/EuropePMC	Published microbiome-disease studies
CTD	Chemical-microbiome-disease relationships

数据库	适用场景
MGnify	已处理的宏基因组研究、分类学/功能分析结果
GTDB	标准化的细菌/古菌分类系统、物种级分辨率
GMrepo	肠道物种与人类健康表型的关联
ENA	原始测序数据集和研究元数据
KEGG	微生物功能注释的通路映射
PubMed/EuropePMC	已发表的微生物组-疾病相关研究
CTD	化学物质-微生物组-疾病的关联关系

Workflow

工作流程

Phase 0: Parse query → organism, biome, phenotype, or accession
Phase 1: Study Discovery → MGnify_search_studies, ENAPortal_search_studies
Phase 2: Taxonomic Classification → GTDB_search_genomes, GTDB_get_species, GTDB_search_taxon
Phase 3: Genome Quality → MGnify_search_genomes, MGnify_get_genome (CheckM metrics)
Phase 4: Functional Annotation → MGnify GO terms + KEGG pathway mapping
Phase 5: Clinical Associations → GMrepo species-phenotype links
Phase 6: Literature → PubMed/EuropePMC + CTD gene-disease
Phase 7: Interpretation & Report Synthesis

Phase 0: Parse query → organism, biome, phenotype, or accession
Phase 1: Study Discovery → MGnify_search_studies, ENAPortal_search_studies
Phase 2: Taxonomic Classification → GTDB_search_genomes, GTDB_get_species, GTDB_search_taxon
Phase 3: Genome Quality → MGnify_search_genomes, MGnify_get_genome (CheckM metrics)
Phase 4: Functional Annotation → MGnify GO terms + KEGG pathway mapping
Phase 5: Clinical Associations → GMrepo species-phenotype links
Phase 6: Literature → PubMed/EuropePMC + CTD gene-disease
Phase 7: Interpretation & Report Synthesis

Key Phase Notes

关键阶段说明

Phase 1: ENA requires structured queries (e.g.,

study_title="*IBD*"

), not free text. If ENA fails, fall back to MGnify.

Phase 2: GTDB uses its own naming (e.g.,

s__Bacteroides_A fragilis

vs NCBI

Bacteroides fragilis

). Always note discrepancies. Use

GTDB_search_taxon(operation="search_taxon", query=name)

Phase 3 - Quality tiers (MIMAG):

High: >= 90% complete, <= 5% contamination, rRNA + >= 18 tRNAs
Medium: >= 50% complete, <= 10% contamination
Low: below medium -- flag but don't exclude

Phase 4 - Functional interpretation: Don't just list GO terms. Connect to biology:

Functional Category	Key KEGG Pathways	Significance
SCFA production	map00650, map00640	Gut barrier, anti-inflammatory
LPS biosynthesis	map00540	Pro-inflammatory, endotoxemia
Bile acid metabolism	map00120	Fat absorption, FXR signaling
Tryptophan metabolism	map00380	Serotonin, AhR, immune
Vitamin biosynthesis	map00730/740/760	Host nutritional contribution

Use

kegg_search_pathway(keyword=...)

(NOT

query

). Pathway IDs need organism prefix (

hsa

ko

eco

), NOT bare

map

Phase 5: GMrepo uses MeSH terms: "Crohn Disease" not "IBD", "Colitis, Ulcerative" not "UC", "Colorectal Neoplasms" not "colorectal cancer". Try NCBI taxon IDs if species name fails.

Phase 6 - Evidence grading:

Strong: Meta-analysis or >5 studies, consistent direction
Moderate: 2-5 studies consistent, or 1 large cohort
Preliminary: Single study or conflicting
Mechanistic only: In vitro/animal, no human epidemiology

Phase 7 - Report: Executive summary, study landscape, GTDB taxonomy, functional interpretation (not GO term lists), clinical relevance with evidence grades, mechanistic model, genome catalog with quality tiers, data gaps.

Phase 1：ENA需要结构化查询（例如

study_title="*IBD*"

），而非自由文本。若ENA查询失败，可退而使用MGnify。

Phase 2：GTDB使用自有命名规则（例如

s__Bacteroides_A fragilis

对应NCBI的

Bacteroides fragilis

）。需始终标注差异。使用

GTDB_search_taxon(operation="search_taxon", query=name)

。

Phase 3 - 质量等级（MIMAG标准）：

高：完整性≥90%，污染率≤5%，包含rRNA + ≥18个tRNA
中：完整性≥50%，污染率≤10%
低：低于中等标准——标记但不排除

Phase 4 - 功能解读：不要仅罗列GO术语，需关联生物学意义：

功能类别	关键KEGG通路	意义
短链脂肪酸（SCFA）合成	map00650、map00640	肠道屏障、抗炎
LPS生物合成	map00540	促炎、内毒素血症
胆汁酸代谢	map00120	脂肪吸收、FXR信号通路
色氨酸代谢	map00380	血清素、AhR、免疫
维生素合成	map00730/740/760	对宿主营养的贡献

使用

kegg_search_pathway(keyword=...)

（而非

query

）。通路ID需要添加物种前缀（

hsa

、

ko

、

eco

），不能仅使用

map

开头的裸ID。

Phase 5：GMrepo使用MeSH术语：需用"Crohn Disease"而非"IBD"，"Colitis, Ulcerative"而非"UC"，"Colorectal Neoplasms"而非"colorectal cancer"。若物种名称查询失败，可尝试使用NCBI分类ID。

Phase 6 - 证据分级：

强：荟萃分析或≥5项研究，结果方向一致
中：2-5项研究结果一致，或1项大型队列研究
初步：仅1项研究或结果存在冲突
仅机制研究：体外/动物实验，无人类流行病学数据

Phase 7 - 报告：执行摘要、研究概况、GTDB分类结果、功能解读（非GO术语列表）、带证据等级的临床相关性、机制模型、带质量等级的基因组目录、数据缺口。

Edge Cases & Fallbacks

边缘情况与备选方案

Taxon not in GTDB: Try partial search or fall back to MGnify (NCBI taxonomy)
No GMrepo data: Normal for non-gut organisms; use literature
GMrepo 0 results: Use formal MeSH terms or NCBI taxon IDs
No KEGG match: Check MetaCyc or literature

类群未收录于GTDB：尝试部分搜索或退而使用MGnify（NCBI分类系统）
无GMrepo数据：非肠道类群属正常情况；可使用文献资料
GMrepo无结果：使用规范的MeSH术语或NCBI分类ID
无KEGG匹配结果：查看MetaCyc或文献资料

Limitations

局限性

GMrepo: Gut-only
GTDB: Bacteria/Archaea only
ENA: Raw data only, strict query syntax
No sequence analysis: Queries databases, not raw FASTQ/FASTA

GMrepo：仅适用于肠道微生物
GTDB：仅覆盖细菌/古菌
ENA：仅提供原始数据，查询语法严格
无序列分析功能：仅查询数据库，不处理原始FASTQ/FASTA序列