bioinformatics-visualization

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Bioinformatics Visualization

生物信息学可视化

iTOL Dataset Formats and Troubleshooting

iTOL数据集格式与问题排查

Choosing the Right Dataset Type

选择合适的数据集类型

DATASET_BINARY (Recommended for markers/symbols):

More reliable than DATASET_SYMBOL
All species must be listed with binary values (0 or 1)
Simpler format, better iTOL compatibility
Use for: presence/absence markers, technology indicators, categorical highlights

Format example:

DATASET_BINARY
SEPARATOR TAB

DATASET_LABEL	CLR Technology
COLOR	#ff0000

LEGEND_TITLE	Sequencing Technology
LEGEND_SHAPES	2
LEGEND_COLORS	#ff0000
LEGEND_LABELS	CLR (PacBio)

FIELD_SHAPES	2
FIELD_COLORS	#ff0000
FIELD_LABELS	CLR

DATA
Species_name_1	1
Species_name_2	0
Species_name_3	1

DATASET_SYMBOL (Less reliable):

Can be finicky about format
Per-species shape/size/color specifications complex
May not display correctly even with valid format
Avoid unless BINARY doesn't meet needs

DATASET_COLORSTRIP (Good for gradients):

Reliable for color gradients (e.g., temporal data, continuous values)
Only species with data need to be listed
Good for non-binary categorical or continuous data

DATASET_BINARY（推荐用于标记/符号）：

比DATASET_SYMBOL更可靠
所有物种必须以二进制值（0或1）列出
格式更简单，与iTOL兼容性更好
适用场景：存在/缺失标记、技术指标、分类高亮

格式示例：

DATASET_BINARY
SEPARATOR TAB

DATASET_LABEL	CLR Technology
COLOR	#ff0000

LEGEND_TITLE	Sequencing Technology
LEGEND_SHAPES	2
LEGEND_COLORS	#ff0000
LEGEND_LABELS	CLR (PacBio)

FIELD_SHAPES	2
FIELD_COLORS	#ff0000
FIELD_LABELS	CLR

DATA
Species_name_1	1
Species_name_2	0
Species_name_3	1

DATASET_SYMBOL（可靠性较低）：

格式要求苛刻
每个物种的形状/大小/颜色规格复杂
即使格式合法也可能无法正确显示
除非BINARY无法满足需求，否则避免使用

DATASET_COLORSTRIP（适合渐变效果）：

用于颜色渐变（如时间数据、连续值）时可靠性高
仅需列出有数据的物种
适用于非二进制分类或连续数据

Common iTOL Errors and Fixes

常见iTOL错误及修复方案

Error: "Unknown variable 'SYMBOL_SHAPE'"

Cause: Mixing global symbol settings with per-species data
Fix: Switch to DATASET_BINARY format

Error: "Invalid color '1' for node X"

Cause: DATASET_SYMBOL data format mismatch
Fix: Use DATASET_BINARY instead, format:
```
species<tab>0_or_1
```

Symbols not appearing on tree:

Likely cause: DATASET_SYMBOL format issues
Fix: Convert to DATASET_BINARY
Verify: Check that all species in config exist in tree file

错误："Unknown variable 'SYMBOL_SHAPE'"

原因：将全局符号设置与每个物种的数据混合使用
修复：切换为DATASET_BINARY格式

错误："Invalid color '1' for node X"

原因：DATASET_SYMBOL数据格式不匹配
修复：改用DATASET_BINARY，格式为：
```
物种名<制表符>0或1
```

符号未显示在树上：

可能原因：DATASET_SYMBOL格式问题
修复：转换为DATASET_BINARY格式
验证：确认配置文件中的所有物种都存在于树文件中

Species Name Compatibility

物种名称兼容性

Critical: Species names must match exactly between tree and annotation files

Common issues:

Case sensitivity: "Alca Torda" vs "Alca_torda"
Spaces vs underscores: Always use underscores in tree format
Subspecies names: Handle three-part names carefully

Fix for case sensitivity:

python

undefined

关键注意事项：树文件和注释文件中的物种名称必须完全匹配

常见问题：

大小写敏感："Alca Torda" 与 "Alca_torda"
空格与下划线：树格式中请始终使用下划线
亚种名称：需谨慎处理三段式名称

大小写敏感问题修复：

python

undefined

Convert scientific names to tree format with case normalization

将科学名称转换为树格式并统一大小写

df['species_tree'] = df['scientific_name'].str.replace(' ', '_')

Fix uppercase after underscore (Alca_Torda -> Alca_torda)

修正下划线后的大写字母（Alca_Torda -> Alca_torda）

df['species_tree'] = df['species_tree'].str.replace( r'([A-Z])', lambda m: '' + m.group(1).lower(), regex=True )


**Validation pattern**:
```python

df['species_tree'] = df['species_tree'].str.replace( r'([A-Z])', lambda m: '' + m.group(1).lower(), regex=True )


**验证代码**：
```python

Always validate species compatibility

务必验证物种兼容性

import re

Extract species from tree

从树文件中提取物种

with open('tree.nwk') as f: tree_content = f.read() tree_species = set(re.findall(r'([A-Z][a-z]+_[a-z]+)', tree_content))

Check config species

检查配置文件中的物种

config_species = set(df['species_tree']) missing = config_species - tree_species

if missing: print(f"Species in config but not in tree: {missing}")

undefined

config_species = set(df['species_tree']) missing = config_species - tree_species

if missing: print(f"配置文件中有但树文件中没有的物种：{missing}")

undefined

Color Gradients for Temporal Data

时间数据的颜色渐变

Effective color schemes:

Temporal progression (old → new):

Light Yellow → Dark Red (ColorBrewer YlOrRd)
Clearly shows progression from past to present
Example:
```
#ffffcc
```
(2019) →
```
#b10026
```
(2025)

Avoid:

Blue → Yellow → Red (confusing middle point)
Diverging palettes for sequential data

ColorBrewer palettes for sequential data:

YlOrRd: Yellow-Orange-Red (temporal, intensity)
YlGn: Yellow-Green (growth, vegetation)
PuBuGn: Purple-Blue-Green (water, depth)

有效的配色方案：

时间序列（旧→新）：

浅黄→深红（ColorBrewer YlOrRd配色）
清晰展示从过去到现在的演变过程
示例：
```
#ffffcc
```
（2019年）→
```
#b10026
```
（2025年）

需避免的方案：

蓝→黄→红（中间点易混淆）
将发散型调色板用于连续数据

适用于连续数据的ColorBrewer调色板：

YlOrRd：黄-橙-红（时间、强度相关）
YlGn：黄-绿（生长、植被相关）
PuBuGn：紫-蓝-绿（水、深度相关）

Debugging Workflow

调试流程

Generate config file
Upload to iTOL (https://itol.embl.de)
If errors: Save error messages to file
Check format: BINARY vs SYMBOL vs COLORSTRIP
Validate species names: Match against tree file
Test with minimal dataset: 5-10 species first
Switch formats if needed: SYMBOL → BINARY usually works

生成配置文件
上传至iTOL（https://itol.embl.de）
若出现错误：将错误信息保存到文件
检查格式：BINARY、SYMBOL还是COLORSTRIP
验证物种名称：与树文件中的名称匹配
使用最小数据集测试：先测试5-10个物种
必要时切换格式：通常SYMBOL转BINARY即可解决问题