bioinformatics-visualization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Bioinformatics Visualization

生物信息学可视化



iTOL Dataset Formats and Troubleshooting

iTOL数据集格式与问题排查

Choosing the Right Dataset Type

选择合适的数据集类型

DATASET_BINARY (Recommended for markers/symbols):
  • More reliable than DATASET_SYMBOL
  • All species must be listed with binary values (0 or 1)
  • Simpler format, better iTOL compatibility
  • Use for: presence/absence markers, technology indicators, categorical highlights
Format example:
DATASET_BINARY
SEPARATOR TAB

DATASET_LABEL	CLR Technology
COLOR	#ff0000

LEGEND_TITLE	Sequencing Technology
LEGEND_SHAPES	2
LEGEND_COLORS	#ff0000
LEGEND_LABELS	CLR (PacBio)

FIELD_SHAPES	2
FIELD_COLORS	#ff0000
FIELD_LABELS	CLR

DATA
Species_name_1	1
Species_name_2	0
Species_name_3	1
DATASET_SYMBOL (Less reliable):
  • Can be finicky about format
  • Per-species shape/size/color specifications complex
  • May not display correctly even with valid format
  • Avoid unless BINARY doesn't meet needs
DATASET_COLORSTRIP (Good for gradients):
  • Reliable for color gradients (e.g., temporal data, continuous values)
  • Only species with data need to be listed
  • Good for non-binary categorical or continuous data
DATASET_BINARY(推荐用于标记/符号):
  • 比DATASET_SYMBOL更可靠
  • 所有物种必须以二进制值(0或1)列出
  • 格式更简单,与iTOL兼容性更好
  • 适用场景:存在/缺失标记、技术指标、分类高亮
格式示例
DATASET_BINARY
SEPARATOR TAB

DATASET_LABEL	CLR Technology
COLOR	#ff0000

LEGEND_TITLE	Sequencing Technology
LEGEND_SHAPES	2
LEGEND_COLORS	#ff0000
LEGEND_LABELS	CLR (PacBio)

FIELD_SHAPES	2
FIELD_COLORS	#ff0000
FIELD_LABELS	CLR

DATA
Species_name_1	1
Species_name_2	0
Species_name_3	1
DATASET_SYMBOL(可靠性较低):
  • 格式要求苛刻
  • 每个物种的形状/大小/颜色规格复杂
  • 即使格式合法也可能无法正确显示
  • 除非BINARY无法满足需求,否则避免使用
DATASET_COLORSTRIP(适合渐变效果):
  • 用于颜色渐变(如时间数据、连续值)时可靠性高
  • 仅需列出有数据的物种
  • 适用于非二进制分类或连续数据

Common iTOL Errors and Fixes

常见iTOL错误及修复方案

Error: "Unknown variable 'SYMBOL_SHAPE'"
  • Cause: Mixing global symbol settings with per-species data
  • Fix: Switch to DATASET_BINARY format
Error: "Invalid color '1' for node X"
  • Cause: DATASET_SYMBOL data format mismatch
  • Fix: Use DATASET_BINARY instead, format:
    species<tab>0_or_1
Symbols not appearing on tree:
  • Likely cause: DATASET_SYMBOL format issues
  • Fix: Convert to DATASET_BINARY
  • Verify: Check that all species in config exist in tree file
错误:"Unknown variable 'SYMBOL_SHAPE'"
  • 原因:将全局符号设置与每个物种的数据混合使用
  • 修复:切换为DATASET_BINARY格式
错误:"Invalid color '1' for node X"
  • 原因:DATASET_SYMBOL数据格式不匹配
  • 修复:改用DATASET_BINARY,格式为:
    物种名<制表符>0或1
符号未显示在树上
  • 可能原因:DATASET_SYMBOL格式问题
  • 修复:转换为DATASET_BINARY格式
  • 验证:确认配置文件中的所有物种都存在于树文件中

Species Name Compatibility

物种名称兼容性

Critical: Species names must match exactly between tree and annotation files
Common issues:
  1. Case sensitivity: "Alca Torda" vs "Alca_torda"
  2. Spaces vs underscores: Always use underscores in tree format
  3. Subspecies names: Handle three-part names carefully
Fix for case sensitivity:
python
undefined
关键注意事项:树文件和注释文件中的物种名称必须完全匹配
常见问题
  1. 大小写敏感:"Alca Torda" 与 "Alca_torda"
  2. 空格与下划线:树格式中请始终使用下划线
  3. 亚种名称:需谨慎处理三段式名称
大小写敏感问题修复
python
undefined

Convert scientific names to tree format with case normalization

将科学名称转换为树格式并统一大小写

df['species_tree'] = df['scientific_name'].str.replace(' ', '_')
df['species_tree'] = df['scientific_name'].str.replace(' ', '_')

Fix uppercase after underscore (Alca_Torda -> Alca_torda)

修正下划线后的大写字母(Alca_Torda -> Alca_torda)

df['species_tree'] = df['species_tree'].str.replace( r'([A-Z])', lambda m: '' + m.group(1).lower(), regex=True )

**Validation pattern**:
```python
df['species_tree'] = df['species_tree'].str.replace( r'([A-Z])', lambda m: '' + m.group(1).lower(), regex=True )

**验证代码**:
```python

Always validate species compatibility

务必验证物种兼容性

import re
import re

Extract species from tree

从树文件中提取物种

with open('tree.nwk') as f: tree_content = f.read() tree_species = set(re.findall(r'([A-Z][a-z]+_[a-z]+)', tree_content))
with open('tree.nwk') as f: tree_content = f.read() tree_species = set(re.findall(r'([A-Z][a-z]+_[a-z]+)', tree_content))

Check config species

检查配置文件中的物种

config_species = set(df['species_tree']) missing = config_species - tree_species
if missing: print(f"Species in config but not in tree: {missing}")
undefined
config_species = set(df['species_tree']) missing = config_species - tree_species
if missing: print(f"配置文件中有但树文件中没有的物种:{missing}")
undefined

Color Gradients for Temporal Data

时间数据的颜色渐变

Effective color schemes:
Temporal progression (old → new):
  • Light Yellow → Dark Red (ColorBrewer YlOrRd)
  • Clearly shows progression from past to present
  • Example:
    #ffffcc
    (2019) →
    #b10026
    (2025)
Avoid:
  • Blue → Yellow → Red (confusing middle point)
  • Diverging palettes for sequential data
ColorBrewer palettes for sequential data:
  • YlOrRd: Yellow-Orange-Red (temporal, intensity)
  • YlGn: Yellow-Green (growth, vegetation)
  • PuBuGn: Purple-Blue-Green (water, depth)
有效的配色方案
时间序列(旧→新)
  • 浅黄→深红(ColorBrewer YlOrRd配色)
  • 清晰展示从过去到现在的演变过程
  • 示例:
    #ffffcc
    (2019年)→
    #b10026
    (2025年)
需避免的方案
  • 蓝→黄→红(中间点易混淆)
  • 将发散型调色板用于连续数据
适用于连续数据的ColorBrewer调色板
  • YlOrRd:黄-橙-红(时间、强度相关)
  • YlGn:黄-绿(生长、植被相关)
  • PuBuGn:紫-蓝-绿(水、深度相关)

Debugging Workflow

调试流程

  1. Generate config file
  2. Upload to iTOL (https://itol.embl.de)
  3. If errors: Save error messages to file
  4. Check format: BINARY vs SYMBOL vs COLORSTRIP
  5. Validate species names: Match against tree file
  6. Test with minimal dataset: 5-10 species first
  7. Switch formats if needed: SYMBOL → BINARY usually works

  1. 生成配置文件
  2. 上传至iTOLhttps://itol.embl.de)
  3. 若出现错误:将错误信息保存到文件
  4. 检查格式:BINARY、SYMBOL还是COLORSTRIP
  5. 验证物种名称:与树文件中的名称匹配
  6. 使用最小数据集测试:先测试5-10个物种
  7. 必要时切换格式:通常SYMBOL转BINARY即可解决问题

Related Skills

相关技能

  • data-visualization: General visualization best practices
  • bioinformatics/fundamentals: Core bioinformatics concepts
  • bioinformatics/phylogenetics: Phylogenetic analysis workflows
  • data-visualization:通用可视化最佳实践
  • bioinformatics/fundamentals:核心生物信息学概念
  • bioinformatics/phylogenetics:系统发育分析工作流