tooluniverse-phylogenetics
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePhylogenetics and Sequence Analysis
系统发育与序列分析
Comprehensive phylogenetics and sequence analysis using PhyKIT, Biopython, and DendroPy. Designed for bioinformatics questions about multiple sequence alignments, phylogenetic trees, parsimony, molecular evolution, and comparative genomics.
IMPORTANT: This skill handles complex phylogenetic workflows. Most implementation details have been moved to for progressive disclosure. This document focuses on high-level decision-making and workflow orchestration.
references/借助PhyKIT、Biopython和DendroPy实现全面的系统发育与序列分析。专为解答多序列比对、系统发育树、简约性、分子进化和比较基因组学相关的生物信息学问题设计。
重要提示:本工具可处理复杂的系统发育工作流。大多数实现细节已移至目录,采用渐进式披露方式。本文档聚焦于高层决策与工作流编排。
references/When to Use This Skill
使用场景
Apply when users:
- Have FASTA alignment files and ask about parsimony informative sites, gaps, or alignment quality
- Have Newick tree files and ask about treeness, tree length, evolutionary rate, or DVMC
- Ask about treeness/RCV, RCV, or relative composition variability
- Need to compare phylogenetic metrics between groups (fungi vs animals, etc.)
- Ask about PhyKIT functions (treeness, rcv, dvmc, evo_rate, parsimony_informative, tree_length)
- Have gene family data with paired alignments and trees
- Need Mann-Whitney U tests or other statistical comparisons of phylogenetic metrics
- Ask about bootstrap support, branch lengths, or tree topology
- Need to build trees (NJ, UPGMA, parsimony) from alignments
- Ask about Robinson-Foulds distance or tree comparison
BixBench Coverage: 33 questions across 8 projects (bix-4, bix-11, bix-12, bix-25, bix-35, bix-38, bix-45, bix-60)
NOT for (use other skills instead):
- Multiple sequence alignment generation → Use external tools (MUSCLE, MAFFT, ClustalW)
- Maximum Likelihood tree construction → Use IQ-TREE, RAxML, or PhyML
- Bayesian phylogenetics → Use MrBayes or BEAST
- Ancestral state reconstruction → Use separate tools
当用户有以下需求时适用:
- 拥有FASTA比对文件,询问简约信息位点、间隙或比对质量相关问题
- 拥有Newick树文件,询问树性、树长、进化速率或DVMC相关问题
- 询问树性/RCV、RCV或相对组成变异性相关内容
- 需要比较不同类群(如真菌vs动物)的系统发育指标
- 询问PhyKIT的功能(树性、rcv、dvmc、evo_rate、parsimony_informative、tree_length)
- 拥有包含配对比对和树的基因家族数据
- 需要对系统发育指标进行Mann-Whitney U检验或其他统计比较
- 询问自展支持度、分支长度或树拓扑结构相关问题
- 需要从比对构建树(NJ、UPGMA、简约法)
- 询问Robinson-Foulds距离或树比较相关内容
BixBench覆盖范围:8个项目中的33个问题(bix-4、bix-11、bix-12、bix-25、bix-35、bix-38、bix-45、bix-60)
不适用场景(请使用其他工具):
- 多序列比对生成 → 使用外部工具(MUSCLE、MAFFT、ClustalW)
- 最大似然树构建 → 使用IQ-TREE、RAxML或PhyML
- 贝叶斯系统发育分析 → 使用MrBayes或BEAST
- 祖先状态重建 → 使用专用工具
Core Principles
核心原则
- Data-first approach - Discover and validate all input files (alignments, trees) before any analysis
- PhyKIT-compatible - Use PhyKIT functions for treeness, RCV, DVMC, parsimony, evolutionary rate (matches BixBench expected outputs)
- Format-flexible - Support FASTA, PHYLIP, Nexus, Newick, and auto-detect formats
- Batch processing - Process hundreds of gene alignments/trees in a single analysis
- Statistical rigor - Mann-Whitney U, medians, percentiles, standard deviations with scipy.stats
- Precision awareness - Match rounding to 4 decimal places (PhyKIT default) or as requested
- Group comparison - Compare metrics between taxa groups (e.g., fungi vs animals)
- Question-driven - Parse exactly what is asked and return the specific number/statistic
- 数据优先 - 在进行任何分析前,先发现并验证所有输入文件(比对、树)
- 兼容PhyKIT - 使用PhyKIT函数计算树性、RCV、DVMC、简约性、进化速率(与BixBench预期输出匹配)
- 格式灵活 - 支持FASTA、PHYLIP、Nexus、Newick格式,并可自动检测格式
- 批量处理 - 单次分析可处理数百个基因比对/树
- 统计严谨 - 使用scipy.stats进行Mann-Whitney U检验、中位数、百分位数、标准差计算
- 精度可控 - 默认保留4位小数(PhyKIT默认),或按需求调整
- 类群比较 - 支持不同类群(如真菌vs动物)的指标比较
- 以问题为导向 - 精准解析用户问题,返回指定的数值/统计结果
Required Python Packages
所需Python包
python
undefinedpython
undefinedCore (MUST be installed)
核心包(必须安装)
import numpy as np
import pandas as pd
from scipy import stats
from Bio import AlignIO, Phylo, SeqIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
import numpy as np
import pandas as pd
from scipy import stats
from Bio import AlignIO, Phylo, SeqIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
PhyKIT (primary computation engine)
PhyKIT(主要计算引擎)
from phykit.services.tree.treeness import Treeness
from phykit.services.tree.total_tree_length import TotalTreeLength
from phykit.services.tree.evolutionary_rate import EvolutionaryRate
from phykit.services.tree.dvmc import DVMC
from phykit.services.tree.treeness_over_rcv import TreenessOverRCV
from phykit.services.alignment.parsimony_informative_sites import ParsimonyInformative
from phykit.services.alignment.rcv import RelativeCompositionVariability
from phykit.services.tree.treeness import Treeness
from phykit.services.tree.total_tree_length import TotalTreeLength
from phykit.services.tree.evolutionary_rate import EvolutionaryRate
from phykit.services.tree.dvmc import DVMC
from phykit.services.tree.treeness_over_rcv import TreenessOverRCV
from phykit.services.alignment.parsimony_informative_sites import ParsimonyInformative
from phykit.services.alignment.rcv import RelativeCompositionVariability
DendroPy (for advanced tree operations)
DendroPy(用于高级树操作)
import dendropy
import dendropy
ToolUniverse (for sequence retrieval)
ToolUniverse(用于序列检索)
from tooluniverse import ToolUniverse
**Installation**:
```bash
pip install phykit dendropy biopython pandas numpy scipyfrom tooluniverse import ToolUniverse
**安装命令**:
```bash
pip install phykit dendropy biopython pandas numpy scipyHigh-Level Workflow Decision Tree
高层工作流决策树
START: User question about phylogenetic data
│
├─ Q1: What type of analysis is needed?
│ │
│ ├─ ALIGNMENT ANALYSIS (FASTA/PHYLIP files)
│ │ ├─ Parsimony informative sites → phykit_parsimony_informative()
│ │ ├─ RCV score → phykit_rcv()
│ │ ├─ Gap percentage → alignment_gap_percentage()
│ │ ├─ GC content → alignment_statistics()
│ │ └─ See: references/sequence_alignment.md
│ │
│ ├─ TREE ANALYSIS (Newick files)
│ │ ├─ Treeness → phykit_treeness()
│ │ ├─ Tree length → phykit_tree_length()
│ │ ├─ Evolutionary rate → phykit_evolutionary_rate()
│ │ ├─ DVMC → phykit_dvmc()
│ │ ├─ Bootstrap support → extract_bootstrap_support()
│ │ └─ See: references/tree_building.md
│ │
│ ├─ COMBINED ANALYSIS (alignment + tree)
│ │ └─ Treeness/RCV → phykit_treeness_over_rcv()
│ │
│ ├─ TREE CONSTRUCTION (build from alignment)
│ │ ├─ Neighbor-Joining → build_nj_tree()
│ │ ├─ UPGMA → build_upgma_tree()
│ │ ├─ Parsimony → build_parsimony_tree()
│ │ └─ See: references/tree_building.md
│ │
│ ├─ GROUP COMPARISON (fungi vs animals, etc.)
│ │ ├─ Batch compute metrics per group
│ │ ├─ Mann-Whitney U test
│ │ ├─ Summary statistics (median, mean, percentiles)
│ │ └─ See: references/parsimony_analysis.md
│ │
│ └─ TREE COMPARISON
│ ├─ Robinson-Foulds distance → robinson_foulds_distance()
│ └─ Bootstrap consensus → bootstrap_analysis()
│
├─ Q2: What data format is available?
│ ├─ FASTA (.fa, .fasta, .faa, .fna)
│ ├─ PHYLIP (.phy, .phylip) - Use phylip-relaxed for long names
│ ├─ Nexus (.nex, .nexus)
│ ├─ Newick (.nwk, .newick, .tre, .tree)
│ └─ Auto-detect with load_alignment() or load_tree()
│
└─ Q3: Is this a batch analysis?
├─ Single gene → Run metric function once
├─ Multiple genes → Use batch_compute_metric()
└─ Group comparison → Use discover_gene_files() + compare_groups()开始:用户提出系统发育数据相关问题
│
├─ 问题1:需要哪种类型的分析?
│ │
│ ├─ 比对分析(FASTA/PHYLIP文件)
│ │ ├─ 简约信息位点 → phykit_parsimony_informative()
│ │ ├─ RCV得分 → phykit_rcv()
│ │ ├─ 间隙百分比 → alignment_gap_percentage()
│ │ ├─ GC含量 → alignment_statistics()
│ │ └─ 详见:references/sequence_alignment.md
│ │
│ ├─ 树分析(Newick文件)
│ │ ├─ 树性 → phykit_treeness()
│ │ ├─ 树长 → phykit_tree_length()
│ │ ├─ 进化速率 → phykit_evolutionary_rate()
│ │ ├─ DVMC → phykit_dvmc()
│ │ ├─ 自展支持度 → extract_bootstrap_support()
│ │ └─ 详见:references/tree_building.md
│ │
│ ├─ 联合分析(比对 + 树)
│ │ └─ 树性/RCV → phykit_treeness_over_rcv()
│ │
│ ├─ 树构建(从比对生成)
│ │ ├─ 邻接法(Neighbor-Joining) → build_nj_tree()
│ │ ├─ UPGMA → build_upgma_tree()
│ │ ├─ 简约法 → build_parsimony_tree()
│ │ └─ 详见:references/tree_building.md
│ │
│ ├─ 类群比较(真菌vs动物等)
│ │ ├─ 批量计算各类群指标
│ │ ├─ Mann-Whitney U检验
│ │ ├─ 汇总统计(中位数、均值、百分位数)
│ │ └─ 详见:references/parsimony_analysis.md
│ │
│ └─ 树比较
│ ├─ Robinson-Foulds距离 → robinson_foulds_distance()
│ └─ 自展一致性 → bootstrap_analysis()
│
├─ 问题2:可用的数据格式是什么?
│ ├─ FASTA(.fa、.fasta、.faa、.fna)
│ ├─ PHYLIP(.phy、.phylip)- 长名称使用phylip-relaxed格式
│ ├─ Nexus(.nex、.nexus)
│ ├─ Newick(.nwk、.newick、.tre、.tree)
│ └─ 通过load_alignment()或load_tree()自动检测
│
└─ 问题3:是否为批量分析?
├─ 单个基因 → 运行一次指标函数
├─ 多个基因 → 使用batch_compute_metric()
└─ 类群比较 → 使用discover_gene_files() + compare_groups()Quick Reference: Common Metrics
快速参考:常用指标
| Metric | Function | Input | Description |
|---|---|---|---|
| Treeness | | Newick | Internal branch length / Total branch length |
| RCV | | FASTA/PHYLIP | Relative Composition Variability |
| Treeness/RCV | | Both | Treeness divided by RCV |
| Tree Length | | Newick | Sum of all branch lengths |
| Evolutionary Rate | | Newick | Total branch length / num terminals |
| DVMC | | Newick | Degree of Violation of Molecular Clock |
| Parsimony Sites | | FASTA/PHYLIP | Sites with ≥2 chars appearing ≥2 times |
| Gap Percentage | | FASTA/PHYLIP | Percentage of gap characters |
See for implementation.
scripts/tree_statistics.py| 指标 | 函数 | 输入 | 描述 |
|---|---|---|---|
| 树性(Treeness) | | Newick | 内部分支长度 / 总分支长度 |
| RCV | | FASTA/PHYLIP | 相对组成变异性 |
| 树性/RCV | | 两者皆可 | 树性除以RCV |
| 树长 | | Newick | 所有分支长度之和 |
| 进化速率 | | Newick | 总分支长度 / 终端节点数 |
| DVMC | | Newick | 分子钟违反程度 |
| 简约信息位点 | | FASTA/PHYLIP | 出现至少2种字符且每种字符至少出现2次的位点 |
| 间隙百分比 | | FASTA/PHYLIP | 间隙字符的百分比 |
详见实现细节。
scripts/tree_statistics.pyCommon Analysis Patterns (BixBench)
常见分析模式(BixBench)
Pattern 1: Single Metric Across Groups
模式1:跨类群的单一指标分析
Question: "What is the median DVMC for fungi vs animals?"
Workflow:
python
undefined问题:"真菌和动物的中位DVMC分别是多少?"
工作流:
python
undefined1. Discover files
1. 发现文件
fungi_genes = discover_gene_files("data/fungi")
animal_genes = discover_gene_files("data/animals")
fungi_genes = discover_gene_files("data/fungi")
animal_genes = discover_gene_files("data/animals")
2. Compute metric
2. 计算指标
fungi_dvmc = batch_dvmc(fungi_genes)
animal_dvmc = batch_dvmc(animal_genes)
fungi_dvmc = batch_dvmc(fungi_genes)
animal_dvmc = batch_dvmc(animal_genes)
3. Compare
3. 比较
fungi_values = list(fungi_dvmc.values())
animal_values = list(animal_dvmc.values())
print(f"Fungi median DVMC: {np.median(fungi_values):.4f}")
print(f"Animal median DVMC: {np.median(animal_values):.4f}")
**See**: `references/parsimony_analysis.md` for full implementationfungi_values = list(fungi_dvmc.values())
animal_values = list(animal_dvmc.values())
print(f"真菌中位DVMC: {np.median(fungi_values):.4f}")
print(f"动物中位DVMC: {np.median(animal_values):.4f}")
**详见**:`references/parsimony_analysis.md`完整实现Pattern 2: Statistical Comparison
模式2:统计比较
Question: "What is the Mann-Whitney U statistic comparing treeness between groups?"
Workflow:
python
from scipy import stats问题:"比较两类群树性的Mann-Whitney U统计量是多少?"
工作流:
python
from scipy import statsCompute treeness for both groups
计算两类群的树性
group1_treeness = batch_treeness(group1_genes)
group2_treeness = batch_treeness(group2_genes)
group1_treeness = batch_treeness(group1_genes)
group2_treeness = batch_treeness(group2_genes)
Mann-Whitney U test (two-sided)
Mann-Whitney U检验(双侧)
u_stat, p_value = stats.mannwhitneyu(
list(group1_treeness.values()),
list(group2_treeness.values()),
alternative='two-sided'
)
print(f"U statistic: {u_stat:.0f}")
print(f"P-value: {p_value:.4e}")
undefinedu_stat, p_value = stats.mannwhitneyu(
list(group1_treeness.values()),
list(group2_treeness.values()),
alternative='two-sided'
)
print(f"U统计量: {u_stat:.0f}")
print(f"P值: {p_value:.4e}")
undefinedPattern 3: Filtering + Metric
模式3:过滤 + 指标计算
Question: "What is the treeness/RCV for alignments with <5% gaps?"
Workflow:
python
undefined问题:"间隙占比<5%的比对的树性/RCV比值是多少?"
工作流:
python
undefined1. Filter by gap percentage
1. 按间隙百分比过滤
valid_genes = []
for entry in gene_files:
if 'aln_file' in entry:
gap_pct = alignment_gap_percentage(entry['aln_file'])
if gap_pct < 5.0:
valid_genes.append(entry)
valid_genes = []
for entry in gene_files:
if 'aln_file' in entry:
gap_pct = alignment_gap_percentage(entry['aln_file'])
if gap_pct < 5.0:
valid_genes.append(entry)
2. Compute metric on filtered set
2. 对过滤后的集合计算指标
results = batch_treeness_over_rcv(valid_genes)
results = batch_treeness_over_rcv(valid_genes)
3. Report
3. 报告结果
values = [r[0] for r in results.values()] # treeness/rcv ratio
print(f"Median treeness/RCV: {np.median(values):.4f}")
undefinedvalues = [r[0] for r in results.values()] # 树性/rcv比值
print(f"中位树性/RCV: {np.median(values):.4f}")
undefinedPattern 4: Specific Gene Lookup
模式4:特定基因查询
Question: "What is the evolutionary rate for gene X?"
Workflow:
python
undefined问题:"基因X的进化速率是多少?"
工作流:
python
undefinedFind gene file
查找基因文件
gene_files = discover_gene_files("data/")
gene_entry = [g for g in gene_files if g['gene_id'] == 'X'][0]
gene_files = discover_gene_files("data/")
gene_entry = [g for g in gene_files if g['gene_id'] == 'X'][0]
Compute metric
计算指标
evo_rate = phykit_evolutionary_rate(gene_entry['tree_file'])
print(f"Evolutionary rate for gene X: {evo_rate:.4f}")
---evo_rate = phykit_evolutionary_rate(gene_entry['tree_file'])
print(f"基因X的进化速率: {evo_rate:.4f}")
---Choosing Methods: When to Use What
方法选择指南
Alignment Methods
比对方法
When building alignments (use external tools, not this skill):
| Method | Speed | Accuracy | Use Case |
|---|---|---|---|
| ClustalW | Slow | Medium | Small datasets (<100 sequences), educational |
| MUSCLE | Fast | High | Medium datasets (100-1000 sequences) |
| MAFFT | Very Fast | Very High | Recommended - Large datasets (>1000 sequences) |
For this skill: Work with pre-aligned sequences. Use to read any format.
load_alignment()构建比对时(使用外部工具,非本工具):
| 方法 | 速度 | 准确性 | 使用场景 |
|---|---|---|---|
| ClustalW | 慢 | 中等 | 小型数据集(<100条序列)、教学场景 |
| MUSCLE | 快 | 高 | 中型数据集(100-1000条序列) |
| MAFFT | 极快 | 极高 | 推荐 - 大型数据集(>1000条序列) |
本工具使用说明:处理预比对序列。使用读取任意格式。
load_alignment()Tree Building Methods
树构建方法
When to use which tree method:
| Method | Speed | Accuracy | Use Case |
|---|---|---|---|
| Neighbor-Joining | Fast | Medium | Quick trees, large datasets, exploratory |
| UPGMA | Fast | Low | Assumes molecular clock, special cases only |
| Maximum Parsimony | Medium | Medium | Small datasets, discrete characters |
| Maximum Likelihood | Slow | High | Use external tools (IQ-TREE, RAxML) for production |
Implementation in this skill:
python
undefined树构建方法选择:
| 方法 | 速度 | 准确性 | 使用场景 |
|---|---|---|---|
| 邻接法(Neighbor-Joining) | 快 | 中等 | 快速生成树、大型数据集、探索性分析 |
| UPGMA | 快 | 低 | 仅适用于假设分子钟的特殊场景 |
| 最大简约法 | 中等 | 中等 | 小型数据集、离散性状 |
| 最大似然法 | 慢 | 高 | 使用外部工具(IQ-TREE、RAxML)用于生产环境 |
本工具实现:
python
undefinedFast distance-based trees
基于距离的快速树构建
tree = build_nj_tree("alignment.fa") # Neighbor-Joining
tree = build_upgma_tree("alignment.fa") # UPGMA
tree = build_nj_tree("alignment.fa") # 邻接法
tree = build_upgma_tree("alignment.fa") # UPGMA
Parsimony (for small alignments)
简约法(适用于小型比对)
tree = build_parsimony_tree("alignment.fa")
**For production ML trees**: Use IQ-TREE or RAxML externally, then analyze with this skill.
See `references/tree_building.md` for detailed implementations.
---tree = build_parsimony_tree("alignment.fa")
**生产环境ML树**:使用外部工具IQ-TREE或RAxML构建,再用本工具分析。
详见`references/tree_building.md`的详细实现。
---Batch Processing
批量处理
Discovering Gene Files
发现基因文件
python
undefinedpython
undefinedAuto-discover paired alignment + tree files
自动发现配对的比对 + 树文件
gene_files = discover_gene_files("data/")
gene_files = discover_gene_files("data/")
Result: list of dicts with 'gene_id', 'aln_file', 'tree_file'
结果:包含'gene_id'、'aln_file'、'tree_file'的字典列表
[
[
{'gene_id': 'gene1', 'aln_file': 'gene1.fa', 'tree_file': 'gene1.nwk'},
{'gene_id': 'gene1', 'aln_file': 'gene1.fa', 'tree_file': 'gene1.nwk'},
{'gene_id': 'gene2', 'aln_file': 'gene2.fa', 'tree_file': 'gene2.nwk'},
{'gene_id': 'gene2', 'aln_file': 'gene2.fa', 'tree_file': 'gene2.nwk'},
...
...
]
]
undefinedundefinedComputing Metrics in Batch
批量计算指标
python
undefinedpython
undefinedTree metrics
树指标
treeness_results = batch_treeness(gene_files)
tree_length_results = batch_tree_length(gene_files)
dvmc_results = batch_dvmc(gene_files)
evo_rate_results = batch_evolutionary_rate(gene_files)
treeness_results = batch_treeness(gene_files)
tree_length_results = batch_tree_length(gene_files)
dvmc_results = batch_dvmc(gene_files)
evo_rate_results = batch_evolutionary_rate(gene_files)
Alignment metrics
比对指标
rcv_results = batch_rcv(gene_files)
pi_results = batch_parsimony_informative(gene_files)
gap_results = batch_gap_percentage(gene_files)
rcv_results = batch_rcv(gene_files)
pi_results = batch_parsimony_informative(gene_files)
gap_results = batch_gap_percentage(gene_files)
Combined metrics
联合指标
treeness_rcv_results = batch_treeness_over_rcv(gene_files)
treeness_rcv_results = batch_treeness_over_rcv(gene_files)
All return dict: {gene_id: value}
所有结果均返回字典:{gene_id: 数值}
undefinedundefinedStatistical Analysis
统计分析
python
undefinedpython
undefinedSummary statistics
汇总统计
stats = summary_stats(list(treeness_results.values()))
stats = summary_stats(list(treeness_results.values()))
Returns: {'mean': ..., 'median': ..., 'std': ..., 'min': ..., 'max': ...}
返回:{'mean': ..., 'median': ..., 'std': ..., 'min': ..., 'max': ...}
Group comparison
类群比较
comparison = compare_groups(
list(fungi_treeness.values()),
list(animal_treeness.values()),
group1_name="Fungi",
group2_name="Animals"
)
comparison = compare_groups(
list(fungi_treeness.values()),
list(animal_treeness.values()),
group1_name="Fungi",
group2_name="Animals"
)
Returns: {'u_statistic': ..., 'p_value': ..., 'Fungi': {...}, 'Animals': {...}}
返回:{'u_statistic': ..., 'p_value': ..., 'Fungi': {...}, 'Animals': {...}}
See `references/parsimony_analysis.md` for full workflow.
---
详见`references/parsimony_analysis.md`完整工作流。
---Answer Extraction for BixBench
BixBench问题提取规则
| Question Pattern | Extraction Method |
|---|---|
| "What is the median X?" | |
| "What is the maximum X?" | |
| "What is the difference between median X for A vs B?" | |
| "What percentage of X have Y above Z?" | |
| "What is the Mann-Whitney U statistic?" | |
| "What is the p-value?" | |
| "What is the X value for gene Y?" | |
| "What is the fold-change in median X?" | |
| "multiplied by 1000" | |
| 问题模式 | 提取方法 |
|---|---|
| "X的中位数是多少?" | |
| "X的最大值是多少?" | |
| "A和B的X中位数之差是多少?" | |
| "有多少百分比的X的Y值高于Z?" | |
| "Mann-Whitney U统计量是多少?" | |
| "P值是多少?" | |
| "基因Y的X值是多少?" | |
| "X中位数的倍数变化是多少?" | |
| "乘以1000" | |
Rounding Rules
舍入规则
- PhyKIT default: 4 decimal places
- Percentages: Match question format (e.g., "35%" → integer, "3.5%" → 1 decimal)
- P-values: Scientific notation for very small values
- U statistics: Integer (no decimals)
- Always check question wording: "rounded to 3 decimal places" overrides defaults
- PhyKIT默认:4位小数
- 百分比:匹配问题格式(如"35%" → 整数,"3.5%" → 1位小数)
- P值:极小值使用科学计数法
- U统计量:整数(无小数)
- 始终检查问题表述:"保留3位小数"的要求优先于默认规则
BixBench Question Coverage
BixBench问题覆盖情况
| Project | Questions | Metrics |
|---|---|---|
| bix-4 | 7 | DVMC analysis (fungi vs animals) |
| bix-11 | 6 | Treeness analysis (median, percentages, Mann-Whitney U) |
| bix-12 | 5 | Parsimony informative sites (counts, percentages, ratios) |
| bix-25 | 2 | Treeness/RCV with gap filtering |
| bix-35 | 4 | Evolutionary rate (specific genes, comparisons) |
| bix-38 | 5 | Tree length (fold-change, variance, paired ratios) |
| bix-45 | 4 | RCV (Mann-Whitney U, medians, paired differences) |
| bix-60 | 1 | Average treeness across multiple trees |
| 项目 | 问题数量 | 指标 |
|---|---|---|
| bix-4 | 7 | DVMC分析(真菌vs动物) |
| bix-11 | 6 | 树性分析(中位数、百分比、Mann-Whitney U) |
| bix-12 | 5 | 简约信息位点(数量、百分比、比值) |
| bix-25 | 2 | 带间隙过滤的树性/RCV |
| bix-35 | 4 | 进化速率(特定基因、比较) |
| bix-38 | 5 | 树长(倍数变化、方差、配对比值) |
| bix-45 | 4 | RCV(Mann-Whitney U、中位数、配对差异) |
| bix-60 | 1 | 多棵树的平均树性 |
ToolUniverse Integration
ToolUniverse集成
Sequence Retrieval
序列检索
python
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()python
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()Get sequences from NCBI
从NCBI获取序列
result = tu.tools.NCBI_get_sequence(accession="NP_000546")
result = tu.tools.NCBI_get_sequence(accession="NP_000546")
Get gene tree from Ensembl
从Ensembl获取基因树
tree_result = tu.tools.EnsemblCompara_get_gene_tree(gene="ENSG00000141510")
tree_result = tu.tools.EnsemblCompara_get_gene_tree(gene="ENSG00000141510")
Get species tree from OpenTree
从OpenTree获取物种树
tree_result = tu.tools.OpenTree_get_induced_subtree(ott_ids="770315,770319")
---tree_result = tu.tools.OpenTree_get_induced_subtree(ott_ids="770315,770319")
---File Structure
文件结构
tooluniverse-phylogenetics/
├── SKILL.md # This file (workflow orchestration)
├── QUICK_START.md # Quick reference
├── test_phylogenetics.py # Comprehensive test suite
├── references/
│ ├── sequence_alignment.md # Alignment analysis details
│ ├── tree_building.md # Tree construction methods
│ ├── parsimony_analysis.md # Statistical comparison workflows
│ └── troubleshooting.md # Common issues and solutions
└── scripts/
├── format_alignment.py # Alignment format conversion
└── tree_statistics.py # Core metric implementationstooluniverse-phylogenetics/
├── SKILL.md # 本文档(工作流编排)
├── QUICK_START.md # 快速参考
├── test_phylogenetics.py # 综合测试套件
├── references/
│ ├── sequence_alignment.md # 比对分析细节
│ ├── tree_building.md # 树构建方法
│ ├── parsimony_analysis.md # 统计比较工作流
│ └── troubleshooting.md # 常见问题与解决方案
└── scripts/
├── format_alignment.py # 比对格式转换
└── tree_statistics.py # 核心指标实现Completeness Checklist
完整性检查清单
Before returning your answer, verify:
- Identified all input files (alignments and/or trees)
- Detected group structure (fungi/animals/etc.) if applicable
- Used correct PhyKIT function for the requested metric
- Processed ALL genes in each group (not just a sample)
- Applied correct statistical test if comparison requested
- Used correct rounding (4 decimals default, or as specified)
- Returned the specific statistic asked for (median, max, U stat, p-value, etc.)
- For percentage questions, confirmed whether answer is integer or decimal
- For "difference" questions, confirmed direction (A - B vs abs difference)
- For Mann-Whitney U, used (default in scipy)
alternative='two-sided'
返回答案前,请验证:
- 已识别所有输入文件(比对和/或树)
- 若适用,已检测类群结构(真菌/动物等)
- 对请求的指标使用了正确的PhyKIT函数
- 处理了每个类群中的所有基因(而非仅样本)
- 若涉及比较,使用了正确的统计检验
- 使用了正确的舍入规则(默认4位小数,或按指定要求)
- 返回了问题要求的特定统计量(中位数、最大值、U统计量、P值等)
- 对于百分比问题,确认答案为整数或小数
- 对于"差异"问题,确认方向(A-B vs 绝对差异)
- 对于Mann-Whitney U检验,使用了(scipy默认)
alternative='two-sided'
Next Steps
下一步
- For detailed alignment analysis workflows → See
references/sequence_alignment.md - For tree construction methods → See
references/tree_building.md - For statistical comparison examples → See
references/parsimony_analysis.md - For common errors and solutions → See
references/troubleshooting.md - For script implementations → See
scripts/tree_statistics.py
- 详细比对分析工作流 → 详见
references/sequence_alignment.md - 树构建方法 → 详见
references/tree_building.md - 统计比较示例 → 详见
references/parsimony_analysis.md - 常见错误与解决方案 → 详见
references/troubleshooting.md - 脚本实现 → 详见
scripts/tree_statistics.py
Support
支持
For issues with:
- PhyKIT functions: Check PhyKIT documentation at https://jlsteenwyk.com/PhyKIT/
- Biopython tree/alignment parsing: See https://biopython.org/wiki/Phylo
- DendroPy operations: See https://dendropy.org/
- ToolUniverse integration: Check ToolUniverse documentation
若遇到以下问题:
- PhyKIT函数:查看PhyKIT文档 https://jlsteenwyk.com/PhyKIT/
- Biopython树/比对解析:详见 https://biopython.org/wiki/Phylo
- DendroPy操作:详见 https://dendropy.org/
- ToolUniverse集成:查看ToolUniverse文档
License
许可证
Same as ToolUniverse framework license.
与ToolUniverse框架许可证相同。