tooluniverse-phylogenetics

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Phylogenetics and Sequence Analysis

系统发育与序列分析

Comprehensive phylogenetics and sequence analysis using PhyKIT, Biopython, and DendroPy. Designed for bioinformatics questions about multiple sequence alignments, phylogenetic trees, parsimony, molecular evolution, and comparative genomics.
IMPORTANT: This skill handles complex phylogenetic workflows. Most implementation details have been moved to
references/
for progressive disclosure. This document focuses on high-level decision-making and workflow orchestration.

借助PhyKIT、Biopython和DendroPy实现全面的系统发育与序列分析。专为解答多序列比对、系统发育树、简约性、分子进化和比较基因组学相关的生物信息学问题设计。
重要提示:本工具可处理复杂的系统发育工作流。大多数实现细节已移至
references/
目录,采用渐进式披露方式。本文档聚焦于高层决策与工作流编排。

When to Use This Skill

使用场景

Apply when users:
  • Have FASTA alignment files and ask about parsimony informative sites, gaps, or alignment quality
  • Have Newick tree files and ask about treeness, tree length, evolutionary rate, or DVMC
  • Ask about treeness/RCV, RCV, or relative composition variability
  • Need to compare phylogenetic metrics between groups (fungi vs animals, etc.)
  • Ask about PhyKIT functions (treeness, rcv, dvmc, evo_rate, parsimony_informative, tree_length)
  • Have gene family data with paired alignments and trees
  • Need Mann-Whitney U tests or other statistical comparisons of phylogenetic metrics
  • Ask about bootstrap support, branch lengths, or tree topology
  • Need to build trees (NJ, UPGMA, parsimony) from alignments
  • Ask about Robinson-Foulds distance or tree comparison
BixBench Coverage: 33 questions across 8 projects (bix-4, bix-11, bix-12, bix-25, bix-35, bix-38, bix-45, bix-60)
NOT for (use other skills instead):
  • Multiple sequence alignment generation → Use external tools (MUSCLE, MAFFT, ClustalW)
  • Maximum Likelihood tree construction → Use IQ-TREE, RAxML, or PhyML
  • Bayesian phylogenetics → Use MrBayes or BEAST
  • Ancestral state reconstruction → Use separate tools

当用户有以下需求时适用:
  • 拥有FASTA比对文件,询问简约信息位点、间隙或比对质量相关问题
  • 拥有Newick树文件,询问树性、树长、进化速率或DVMC相关问题
  • 询问树性/RCV、RCV或相对组成变异性相关内容
  • 需要比较不同类群(如真菌vs动物)的系统发育指标
  • 询问PhyKIT的功能(树性、rcv、dvmc、evo_rate、parsimony_informative、tree_length)
  • 拥有包含配对比对和树的基因家族数据
  • 需要对系统发育指标进行Mann-Whitney U检验或其他统计比较
  • 询问自展支持度、分支长度或树拓扑结构相关问题
  • 需要从比对构建树(NJ、UPGMA、简约法)
  • 询问Robinson-Foulds距离或树比较相关内容
BixBench覆盖范围:8个项目中的33个问题(bix-4、bix-11、bix-12、bix-25、bix-35、bix-38、bix-45、bix-60)
不适用场景(请使用其他工具):
  • 多序列比对生成 → 使用外部工具(MUSCLE、MAFFT、ClustalW)
  • 最大似然树构建 → 使用IQ-TREE、RAxML或PhyML
  • 贝叶斯系统发育分析 → 使用MrBayes或BEAST
  • 祖先状态重建 → 使用专用工具

Core Principles

核心原则

  1. Data-first approach - Discover and validate all input files (alignments, trees) before any analysis
  2. PhyKIT-compatible - Use PhyKIT functions for treeness, RCV, DVMC, parsimony, evolutionary rate (matches BixBench expected outputs)
  3. Format-flexible - Support FASTA, PHYLIP, Nexus, Newick, and auto-detect formats
  4. Batch processing - Process hundreds of gene alignments/trees in a single analysis
  5. Statistical rigor - Mann-Whitney U, medians, percentiles, standard deviations with scipy.stats
  6. Precision awareness - Match rounding to 4 decimal places (PhyKIT default) or as requested
  7. Group comparison - Compare metrics between taxa groups (e.g., fungi vs animals)
  8. Question-driven - Parse exactly what is asked and return the specific number/statistic

  1. 数据优先 - 在进行任何分析前,先发现并验证所有输入文件(比对、树)
  2. 兼容PhyKIT - 使用PhyKIT函数计算树性、RCV、DVMC、简约性、进化速率(与BixBench预期输出匹配)
  3. 格式灵活 - 支持FASTA、PHYLIP、Nexus、Newick格式,并可自动检测格式
  4. 批量处理 - 单次分析可处理数百个基因比对/树
  5. 统计严谨 - 使用scipy.stats进行Mann-Whitney U检验、中位数、百分位数、标准差计算
  6. 精度可控 - 默认保留4位小数(PhyKIT默认),或按需求调整
  7. 类群比较 - 支持不同类群(如真菌vs动物)的指标比较
  8. 以问题为导向 - 精准解析用户问题,返回指定的数值/统计结果

Required Python Packages

所需Python包

python
undefined
python
undefined

Core (MUST be installed)

核心包(必须安装)

import numpy as np import pandas as pd from scipy import stats from Bio import AlignIO, Phylo, SeqIO from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
import numpy as np import pandas as pd from scipy import stats from Bio import AlignIO, Phylo, SeqIO from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

PhyKIT (primary computation engine)

PhyKIT(主要计算引擎)

from phykit.services.tree.treeness import Treeness from phykit.services.tree.total_tree_length import TotalTreeLength from phykit.services.tree.evolutionary_rate import EvolutionaryRate from phykit.services.tree.dvmc import DVMC from phykit.services.tree.treeness_over_rcv import TreenessOverRCV from phykit.services.alignment.parsimony_informative_sites import ParsimonyInformative from phykit.services.alignment.rcv import RelativeCompositionVariability
from phykit.services.tree.treeness import Treeness from phykit.services.tree.total_tree_length import TotalTreeLength from phykit.services.tree.evolutionary_rate import EvolutionaryRate from phykit.services.tree.dvmc import DVMC from phykit.services.tree.treeness_over_rcv import TreenessOverRCV from phykit.services.alignment.parsimony_informative_sites import ParsimonyInformative from phykit.services.alignment.rcv import RelativeCompositionVariability

DendroPy (for advanced tree operations)

DendroPy(用于高级树操作)

import dendropy
import dendropy

ToolUniverse (for sequence retrieval)

ToolUniverse(用于序列检索)

from tooluniverse import ToolUniverse

**Installation**:
```bash
pip install phykit dendropy biopython pandas numpy scipy

from tooluniverse import ToolUniverse

**安装命令**:
```bash
pip install phykit dendropy biopython pandas numpy scipy

High-Level Workflow Decision Tree

高层工作流决策树

START: User question about phylogenetic data
├─ Q1: What type of analysis is needed?
│  │
│  ├─ ALIGNMENT ANALYSIS (FASTA/PHYLIP files)
│  │  ├─ Parsimony informative sites → phykit_parsimony_informative()
│  │  ├─ RCV score → phykit_rcv()
│  │  ├─ Gap percentage → alignment_gap_percentage()
│  │  ├─ GC content → alignment_statistics()
│  │  └─ See: references/sequence_alignment.md
│  │
│  ├─ TREE ANALYSIS (Newick files)
│  │  ├─ Treeness → phykit_treeness()
│  │  ├─ Tree length → phykit_tree_length()
│  │  ├─ Evolutionary rate → phykit_evolutionary_rate()
│  │  ├─ DVMC → phykit_dvmc()
│  │  ├─ Bootstrap support → extract_bootstrap_support()
│  │  └─ See: references/tree_building.md
│  │
│  ├─ COMBINED ANALYSIS (alignment + tree)
│  │  └─ Treeness/RCV → phykit_treeness_over_rcv()
│  │
│  ├─ TREE CONSTRUCTION (build from alignment)
│  │  ├─ Neighbor-Joining → build_nj_tree()
│  │  ├─ UPGMA → build_upgma_tree()
│  │  ├─ Parsimony → build_parsimony_tree()
│  │  └─ See: references/tree_building.md
│  │
│  ├─ GROUP COMPARISON (fungi vs animals, etc.)
│  │  ├─ Batch compute metrics per group
│  │  ├─ Mann-Whitney U test
│  │  ├─ Summary statistics (median, mean, percentiles)
│  │  └─ See: references/parsimony_analysis.md
│  │
│  └─ TREE COMPARISON
│     ├─ Robinson-Foulds distance → robinson_foulds_distance()
│     └─ Bootstrap consensus → bootstrap_analysis()
├─ Q2: What data format is available?
│  ├─ FASTA (.fa, .fasta, .faa, .fna)
│  ├─ PHYLIP (.phy, .phylip) - Use phylip-relaxed for long names
│  ├─ Nexus (.nex, .nexus)
│  ├─ Newick (.nwk, .newick, .tre, .tree)
│  └─ Auto-detect with load_alignment() or load_tree()
└─ Q3: Is this a batch analysis?
   ├─ Single gene → Run metric function once
   ├─ Multiple genes → Use batch_compute_metric()
   └─ Group comparison → Use discover_gene_files() + compare_groups()

开始:用户提出系统发育数据相关问题
├─ 问题1:需要哪种类型的分析?
│  │
│  ├─ 比对分析(FASTA/PHYLIP文件)
│  │  ├─ 简约信息位点 → phykit_parsimony_informative()
│  │  ├─ RCV得分 → phykit_rcv()
│  │  ├─ 间隙百分比 → alignment_gap_percentage()
│  │  ├─ GC含量 → alignment_statistics()
│  │  └─ 详见:references/sequence_alignment.md
│  │
│  ├─ 树分析(Newick文件)
│  │  ├─ 树性 → phykit_treeness()
│  │  ├─ 树长 → phykit_tree_length()
│  │  ├─ 进化速率 → phykit_evolutionary_rate()
│  │  ├─ DVMC → phykit_dvmc()
│  │  ├─ 自展支持度 → extract_bootstrap_support()
│  │  └─ 详见:references/tree_building.md
│  │
│  ├─ 联合分析(比对 + 树)
│  │  └─ 树性/RCV → phykit_treeness_over_rcv()
│  │
│  ├─ 树构建(从比对生成)
│  │  ├─ 邻接法(Neighbor-Joining) → build_nj_tree()
│  │  ├─ UPGMA → build_upgma_tree()
│  │  ├─ 简约法 → build_parsimony_tree()
│  │  └─ 详见:references/tree_building.md
│  │
│  ├─ 类群比较(真菌vs动物等)
│  │  ├─ 批量计算各类群指标
│  │  ├─ Mann-Whitney U检验
│  │  ├─ 汇总统计(中位数、均值、百分位数)
│  │  └─ 详见:references/parsimony_analysis.md
│  │
│  └─ 树比较
│     ├─ Robinson-Foulds距离 → robinson_foulds_distance()
│     └─ 自展一致性 → bootstrap_analysis()
├─ 问题2:可用的数据格式是什么?
│  ├─ FASTA(.fa、.fasta、.faa、.fna)
│  ├─ PHYLIP(.phy、.phylip)- 长名称使用phylip-relaxed格式
│  ├─ Nexus(.nex、.nexus)
│  ├─ Newick(.nwk、.newick、.tre、.tree)
│  └─ 通过load_alignment()或load_tree()自动检测
└─ 问题3:是否为批量分析?
   ├─ 单个基因 → 运行一次指标函数
   ├─ 多个基因 → 使用batch_compute_metric()
   └─ 类群比较 → 使用discover_gene_files() + compare_groups()

Quick Reference: Common Metrics

快速参考:常用指标

MetricFunctionInputDescription
Treeness
phykit_treeness(tree_file)
NewickInternal branch length / Total branch length
RCV
phykit_rcv(aln_file)
FASTA/PHYLIPRelative Composition Variability
Treeness/RCV
phykit_treeness_over_rcv(tree, aln)
BothTreeness divided by RCV
Tree Length
phykit_tree_length(tree_file)
NewickSum of all branch lengths
Evolutionary Rate
phykit_evolutionary_rate(tree_file)
NewickTotal branch length / num terminals
DVMC
phykit_dvmc(tree_file)
NewickDegree of Violation of Molecular Clock
Parsimony Sites
phykit_parsimony_informative(aln_file)
FASTA/PHYLIPSites with ≥2 chars appearing ≥2 times
Gap Percentage
alignment_gap_percentage(aln_file)
FASTA/PHYLIPPercentage of gap characters
See
scripts/tree_statistics.py
for implementation.

指标函数输入描述
树性(Treeness)
phykit_treeness(tree_file)
Newick内部分支长度 / 总分支长度
RCV
phykit_rcv(aln_file)
FASTA/PHYLIP相对组成变异性
树性/RCV
phykit_treeness_over_rcv(tree, aln)
两者皆可树性除以RCV
树长
phykit_tree_length(tree_file)
Newick所有分支长度之和
进化速率
phykit_evolutionary_rate(tree_file)
Newick总分支长度 / 终端节点数
DVMC
phykit_dvmc(tree_file)
Newick分子钟违反程度
简约信息位点
phykit_parsimony_informative(aln_file)
FASTA/PHYLIP出现至少2种字符且每种字符至少出现2次的位点
间隙百分比
alignment_gap_percentage(aln_file)
FASTA/PHYLIP间隙字符的百分比
详见
scripts/tree_statistics.py
实现细节。

Common Analysis Patterns (BixBench)

常见分析模式(BixBench)

Pattern 1: Single Metric Across Groups

模式1:跨类群的单一指标分析

Question: "What is the median DVMC for fungi vs animals?"
Workflow:
python
undefined
问题:"真菌和动物的中位DVMC分别是多少?"
工作流:
python
undefined

1. Discover files

1. 发现文件

fungi_genes = discover_gene_files("data/fungi") animal_genes = discover_gene_files("data/animals")
fungi_genes = discover_gene_files("data/fungi") animal_genes = discover_gene_files("data/animals")

2. Compute metric

2. 计算指标

fungi_dvmc = batch_dvmc(fungi_genes) animal_dvmc = batch_dvmc(animal_genes)
fungi_dvmc = batch_dvmc(fungi_genes) animal_dvmc = batch_dvmc(animal_genes)

3. Compare

3. 比较

fungi_values = list(fungi_dvmc.values()) animal_values = list(animal_dvmc.values())
print(f"Fungi median DVMC: {np.median(fungi_values):.4f}") print(f"Animal median DVMC: {np.median(animal_values):.4f}")

**See**: `references/parsimony_analysis.md` for full implementation
fungi_values = list(fungi_dvmc.values()) animal_values = list(animal_dvmc.values())
print(f"真菌中位DVMC: {np.median(fungi_values):.4f}") print(f"动物中位DVMC: {np.median(animal_values):.4f}")

**详见**:`references/parsimony_analysis.md`完整实现

Pattern 2: Statistical Comparison

模式2:统计比较

Question: "What is the Mann-Whitney U statistic comparing treeness between groups?"
Workflow:
python
from scipy import stats
问题:"比较两类群树性的Mann-Whitney U统计量是多少?"
工作流:
python
from scipy import stats

Compute treeness for both groups

计算两类群的树性

group1_treeness = batch_treeness(group1_genes) group2_treeness = batch_treeness(group2_genes)
group1_treeness = batch_treeness(group1_genes) group2_treeness = batch_treeness(group2_genes)

Mann-Whitney U test (two-sided)

Mann-Whitney U检验(双侧)

u_stat, p_value = stats.mannwhitneyu( list(group1_treeness.values()), list(group2_treeness.values()), alternative='two-sided' )
print(f"U statistic: {u_stat:.0f}") print(f"P-value: {p_value:.4e}")
undefined
u_stat, p_value = stats.mannwhitneyu( list(group1_treeness.values()), list(group2_treeness.values()), alternative='two-sided' )
print(f"U统计量: {u_stat:.0f}") print(f"P值: {p_value:.4e}")
undefined

Pattern 3: Filtering + Metric

模式3:过滤 + 指标计算

Question: "What is the treeness/RCV for alignments with <5% gaps?"
Workflow:
python
undefined
问题:"间隙占比<5%的比对的树性/RCV比值是多少?"
工作流:
python
undefined

1. Filter by gap percentage

1. 按间隙百分比过滤

valid_genes = [] for entry in gene_files: if 'aln_file' in entry: gap_pct = alignment_gap_percentage(entry['aln_file']) if gap_pct < 5.0: valid_genes.append(entry)
valid_genes = [] for entry in gene_files: if 'aln_file' in entry: gap_pct = alignment_gap_percentage(entry['aln_file']) if gap_pct < 5.0: valid_genes.append(entry)

2. Compute metric on filtered set

2. 对过滤后的集合计算指标

results = batch_treeness_over_rcv(valid_genes)
results = batch_treeness_over_rcv(valid_genes)

3. Report

3. 报告结果

values = [r[0] for r in results.values()] # treeness/rcv ratio print(f"Median treeness/RCV: {np.median(values):.4f}")
undefined
values = [r[0] for r in results.values()] # 树性/rcv比值 print(f"中位树性/RCV: {np.median(values):.4f}")
undefined

Pattern 4: Specific Gene Lookup

模式4:特定基因查询

Question: "What is the evolutionary rate for gene X?"
Workflow:
python
undefined
问题:"基因X的进化速率是多少?"
工作流:
python
undefined

Find gene file

查找基因文件

gene_files = discover_gene_files("data/") gene_entry = [g for g in gene_files if g['gene_id'] == 'X'][0]
gene_files = discover_gene_files("data/") gene_entry = [g for g in gene_files if g['gene_id'] == 'X'][0]

Compute metric

计算指标

evo_rate = phykit_evolutionary_rate(gene_entry['tree_file'])
print(f"Evolutionary rate for gene X: {evo_rate:.4f}")

---
evo_rate = phykit_evolutionary_rate(gene_entry['tree_file'])
print(f"基因X的进化速率: {evo_rate:.4f}")

---

Choosing Methods: When to Use What

方法选择指南

Alignment Methods

比对方法

When building alignments (use external tools, not this skill):
MethodSpeedAccuracyUse Case
ClustalWSlowMediumSmall datasets (<100 sequences), educational
MUSCLEFastHighMedium datasets (100-1000 sequences)
MAFFTVery FastVery HighRecommended - Large datasets (>1000 sequences)
For this skill: Work with pre-aligned sequences. Use
load_alignment()
to read any format.
构建比对时(使用外部工具,非本工具):
方法速度准确性使用场景
ClustalW中等小型数据集(<100条序列)、教学场景
MUSCLE中型数据集(100-1000条序列)
MAFFT极快极高推荐 - 大型数据集(>1000条序列)
本工具使用说明:处理预比对序列。使用
load_alignment()
读取任意格式。

Tree Building Methods

树构建方法

When to use which tree method:
MethodSpeedAccuracyUse Case
Neighbor-JoiningFastMediumQuick trees, large datasets, exploratory
UPGMAFastLowAssumes molecular clock, special cases only
Maximum ParsimonyMediumMediumSmall datasets, discrete characters
Maximum LikelihoodSlowHighUse external tools (IQ-TREE, RAxML) for production
Implementation in this skill:
python
undefined
树构建方法选择
方法速度准确性使用场景
邻接法(Neighbor-Joining)中等快速生成树、大型数据集、探索性分析
UPGMA仅适用于假设分子钟的特殊场景
最大简约法中等中等小型数据集、离散性状
最大似然法使用外部工具(IQ-TREE、RAxML)用于生产环境
本工具实现:
python
undefined

Fast distance-based trees

基于距离的快速树构建

tree = build_nj_tree("alignment.fa") # Neighbor-Joining tree = build_upgma_tree("alignment.fa") # UPGMA
tree = build_nj_tree("alignment.fa") # 邻接法 tree = build_upgma_tree("alignment.fa") # UPGMA

Parsimony (for small alignments)

简约法(适用于小型比对)

tree = build_parsimony_tree("alignment.fa")

**For production ML trees**: Use IQ-TREE or RAxML externally, then analyze with this skill.

See `references/tree_building.md` for detailed implementations.

---
tree = build_parsimony_tree("alignment.fa")

**生产环境ML树**:使用外部工具IQ-TREE或RAxML构建,再用本工具分析。

详见`references/tree_building.md`的详细实现。

---

Batch Processing

批量处理

Discovering Gene Files

发现基因文件

python
undefined
python
undefined

Auto-discover paired alignment + tree files

自动发现配对的比对 + 树文件

gene_files = discover_gene_files("data/")
gene_files = discover_gene_files("data/")

Result: list of dicts with 'gene_id', 'aln_file', 'tree_file'

结果:包含'gene_id'、'aln_file'、'tree_file'的字典列表

[

[

{'gene_id': 'gene1', 'aln_file': 'gene1.fa', 'tree_file': 'gene1.nwk'},

{'gene_id': 'gene1', 'aln_file': 'gene1.fa', 'tree_file': 'gene1.nwk'},

{'gene_id': 'gene2', 'aln_file': 'gene2.fa', 'tree_file': 'gene2.nwk'},

{'gene_id': 'gene2', 'aln_file': 'gene2.fa', 'tree_file': 'gene2.nwk'},

...

...

]

]

undefined
undefined

Computing Metrics in Batch

批量计算指标

python
undefined
python
undefined

Tree metrics

树指标

treeness_results = batch_treeness(gene_files) tree_length_results = batch_tree_length(gene_files) dvmc_results = batch_dvmc(gene_files) evo_rate_results = batch_evolutionary_rate(gene_files)
treeness_results = batch_treeness(gene_files) tree_length_results = batch_tree_length(gene_files) dvmc_results = batch_dvmc(gene_files) evo_rate_results = batch_evolutionary_rate(gene_files)

Alignment metrics

比对指标

rcv_results = batch_rcv(gene_files) pi_results = batch_parsimony_informative(gene_files) gap_results = batch_gap_percentage(gene_files)
rcv_results = batch_rcv(gene_files) pi_results = batch_parsimony_informative(gene_files) gap_results = batch_gap_percentage(gene_files)

Combined metrics

联合指标

treeness_rcv_results = batch_treeness_over_rcv(gene_files)
treeness_rcv_results = batch_treeness_over_rcv(gene_files)

All return dict: {gene_id: value}

所有结果均返回字典:{gene_id: 数值}

undefined
undefined

Statistical Analysis

统计分析

python
undefined
python
undefined

Summary statistics

汇总统计

stats = summary_stats(list(treeness_results.values()))
stats = summary_stats(list(treeness_results.values()))

Returns: {'mean': ..., 'median': ..., 'std': ..., 'min': ..., 'max': ...}

返回:{'mean': ..., 'median': ..., 'std': ..., 'min': ..., 'max': ...}

Group comparison

类群比较

comparison = compare_groups( list(fungi_treeness.values()), list(animal_treeness.values()), group1_name="Fungi", group2_name="Animals" )
comparison = compare_groups( list(fungi_treeness.values()), list(animal_treeness.values()), group1_name="Fungi", group2_name="Animals" )

Returns: {'u_statistic': ..., 'p_value': ..., 'Fungi': {...}, 'Animals': {...}}

返回:{'u_statistic': ..., 'p_value': ..., 'Fungi': {...}, 'Animals': {...}}


See `references/parsimony_analysis.md` for full workflow.

---

详见`references/parsimony_analysis.md`完整工作流。

---

Answer Extraction for BixBench

BixBench问题提取规则

Question PatternExtraction Method
"What is the median X?"
np.median(values)
"What is the maximum X?"
np.max(values)
"What is the difference between median X for A vs B?"
abs(np.median(a) - np.median(b))
"What percentage of X have Y above Z?"
sum(v > Z for v in values) / len(values) * 100
"What is the Mann-Whitney U statistic?"
stats.mannwhitneyu(a, b)[0]
"What is the p-value?"
stats.mannwhitneyu(a, b)[1]
"What is the X value for gene Y?"
results[gene_id]
"What is the fold-change in median X?"
np.median(a) / np.median(b)
"multiplied by 1000"
round(value * 1000)
问题模式提取方法
"X的中位数是多少?"
np.median(values)
"X的最大值是多少?"
np.max(values)
"A和B的X中位数之差是多少?"
abs(np.median(a) - np.median(b))
"有多少百分比的X的Y值高于Z?"
sum(v > Z for v in values) / len(values) * 100
"Mann-Whitney U统计量是多少?"
stats.mannwhitneyu(a, b)[0]
"P值是多少?"
stats.mannwhitneyu(a, b)[1]
"基因Y的X值是多少?"
results[gene_id]
"X中位数的倍数变化是多少?"
np.median(a) / np.median(b)
"乘以1000"
round(value * 1000)

Rounding Rules

舍入规则

  • PhyKIT default: 4 decimal places
  • Percentages: Match question format (e.g., "35%" → integer, "3.5%" → 1 decimal)
  • P-values: Scientific notation for very small values
  • U statistics: Integer (no decimals)
  • Always check question wording: "rounded to 3 decimal places" overrides defaults

  • PhyKIT默认:4位小数
  • 百分比:匹配问题格式(如"35%" → 整数,"3.5%" → 1位小数)
  • P值:极小值使用科学计数法
  • U统计量:整数(无小数)
  • 始终检查问题表述:"保留3位小数"的要求优先于默认规则

BixBench Question Coverage

BixBench问题覆盖情况

ProjectQuestionsMetrics
bix-47DVMC analysis (fungi vs animals)
bix-116Treeness analysis (median, percentages, Mann-Whitney U)
bix-125Parsimony informative sites (counts, percentages, ratios)
bix-252Treeness/RCV with gap filtering
bix-354Evolutionary rate (specific genes, comparisons)
bix-385Tree length (fold-change, variance, paired ratios)
bix-454RCV (Mann-Whitney U, medians, paired differences)
bix-601Average treeness across multiple trees

项目问题数量指标
bix-47DVMC分析(真菌vs动物)
bix-116树性分析(中位数、百分比、Mann-Whitney U)
bix-125简约信息位点(数量、百分比、比值)
bix-252带间隙过滤的树性/RCV
bix-354进化速率(特定基因、比较)
bix-385树长(倍数变化、方差、配对比值)
bix-454RCV(Mann-Whitney U、中位数、配对差异)
bix-601多棵树的平均树性

ToolUniverse Integration

ToolUniverse集成

Sequence Retrieval

序列检索

python
from tooluniverse import ToolUniverse

tu = ToolUniverse()
tu.load_tools()
python
from tooluniverse import ToolUniverse

tu = ToolUniverse()
tu.load_tools()

Get sequences from NCBI

从NCBI获取序列

result = tu.tools.NCBI_get_sequence(accession="NP_000546")
result = tu.tools.NCBI_get_sequence(accession="NP_000546")

Get gene tree from Ensembl

从Ensembl获取基因树

tree_result = tu.tools.EnsemblCompara_get_gene_tree(gene="ENSG00000141510")
tree_result = tu.tools.EnsemblCompara_get_gene_tree(gene="ENSG00000141510")

Get species tree from OpenTree

从OpenTree获取物种树

tree_result = tu.tools.OpenTree_get_induced_subtree(ott_ids="770315,770319")

---
tree_result = tu.tools.OpenTree_get_induced_subtree(ott_ids="770315,770319")

---

File Structure

文件结构

tooluniverse-phylogenetics/
├── SKILL.md                           # This file (workflow orchestration)
├── QUICK_START.md                     # Quick reference
├── test_phylogenetics.py             # Comprehensive test suite
├── references/
│   ├── sequence_alignment.md         # Alignment analysis details
│   ├── tree_building.md              # Tree construction methods
│   ├── parsimony_analysis.md         # Statistical comparison workflows
│   └── troubleshooting.md            # Common issues and solutions
└── scripts/
    ├── format_alignment.py           # Alignment format conversion
    └── tree_statistics.py            # Core metric implementations

tooluniverse-phylogenetics/
├── SKILL.md                           # 本文档(工作流编排)
├── QUICK_START.md                     # 快速参考
├── test_phylogenetics.py             # 综合测试套件
├── references/
│   ├── sequence_alignment.md         # 比对分析细节
│   ├── tree_building.md              # 树构建方法
│   ├── parsimony_analysis.md         # 统计比较工作流
│   └── troubleshooting.md            # 常见问题与解决方案
└── scripts/
    ├── format_alignment.py           # 比对格式转换
    └── tree_statistics.py            # 核心指标实现

Completeness Checklist

完整性检查清单

Before returning your answer, verify:
  • Identified all input files (alignments and/or trees)
  • Detected group structure (fungi/animals/etc.) if applicable
  • Used correct PhyKIT function for the requested metric
  • Processed ALL genes in each group (not just a sample)
  • Applied correct statistical test if comparison requested
  • Used correct rounding (4 decimals default, or as specified)
  • Returned the specific statistic asked for (median, max, U stat, p-value, etc.)
  • For percentage questions, confirmed whether answer is integer or decimal
  • For "difference" questions, confirmed direction (A - B vs abs difference)
  • For Mann-Whitney U, used
    alternative='two-sided'
    (default in scipy)

返回答案前,请验证:
  • 已识别所有输入文件(比对和/或树)
  • 若适用,已检测类群结构(真菌/动物等)
  • 对请求的指标使用了正确的PhyKIT函数
  • 处理了每个类群中的所有基因(而非仅样本)
  • 若涉及比较,使用了正确的统计检验
  • 使用了正确的舍入规则(默认4位小数,或按指定要求)
  • 返回了问题要求的特定统计量(中位数、最大值、U统计量、P值等)
  • 对于百分比问题,确认答案为整数或小数
  • 对于"差异"问题,确认方向(A-B vs 绝对差异)
  • 对于Mann-Whitney U检验,使用了
    alternative='two-sided'
    (scipy默认)

Next Steps

下一步

  • For detailed alignment analysis workflows → See
    references/sequence_alignment.md
  • For tree construction methods → See
    references/tree_building.md
  • For statistical comparison examples → See
    references/parsimony_analysis.md
  • For common errors and solutions → See
    references/troubleshooting.md
  • For script implementations → See
    scripts/tree_statistics.py

  • 详细比对分析工作流 → 详见
    references/sequence_alignment.md
  • 树构建方法 → 详见
    references/tree_building.md
  • 统计比较示例 → 详见
    references/parsimony_analysis.md
  • 常见错误与解决方案 → 详见
    references/troubleshooting.md
  • 脚本实现 → 详见
    scripts/tree_statistics.py

Support

支持

For issues with:
若遇到以下问题:

License

许可证

Same as ToolUniverse framework license.
与ToolUniverse框架许可证相同。