bioinformatics-visualization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBioinformatics Visualization
生物信息学可视化
iTOL Dataset Formats and Troubleshooting
iTOL数据集格式与问题排查
Choosing the Right Dataset Type
选择合适的数据集类型
DATASET_BINARY (Recommended for markers/symbols):
- More reliable than DATASET_SYMBOL
- All species must be listed with binary values (0 or 1)
- Simpler format, better iTOL compatibility
- Use for: presence/absence markers, technology indicators, categorical highlights
Format example:
DATASET_BINARY
SEPARATOR TAB
DATASET_LABEL CLR Technology
COLOR #ff0000
LEGEND_TITLE Sequencing Technology
LEGEND_SHAPES 2
LEGEND_COLORS #ff0000
LEGEND_LABELS CLR (PacBio)
FIELD_SHAPES 2
FIELD_COLORS #ff0000
FIELD_LABELS CLR
DATA
Species_name_1 1
Species_name_2 0
Species_name_3 1DATASET_SYMBOL (Less reliable):
- Can be finicky about format
- Per-species shape/size/color specifications complex
- May not display correctly even with valid format
- Avoid unless BINARY doesn't meet needs
DATASET_COLORSTRIP (Good for gradients):
- Reliable for color gradients (e.g., temporal data, continuous values)
- Only species with data need to be listed
- Good for non-binary categorical or continuous data
DATASET_BINARY(推荐用于标记/符号):
- 比DATASET_SYMBOL更可靠
- 所有物种必须以二进制值(0或1)列出
- 格式更简单,与iTOL兼容性更好
- 适用场景:存在/缺失标记、技术指标、分类高亮
格式示例:
DATASET_BINARY
SEPARATOR TAB
DATASET_LABEL CLR Technology
COLOR #ff0000
LEGEND_TITLE Sequencing Technology
LEGEND_SHAPES 2
LEGEND_COLORS #ff0000
LEGEND_LABELS CLR (PacBio)
FIELD_SHAPES 2
FIELD_COLORS #ff0000
FIELD_LABELS CLR
DATA
Species_name_1 1
Species_name_2 0
Species_name_3 1DATASET_SYMBOL(可靠性较低):
- 格式要求苛刻
- 每个物种的形状/大小/颜色规格复杂
- 即使格式合法也可能无法正确显示
- 除非BINARY无法满足需求,否则避免使用
DATASET_COLORSTRIP(适合渐变效果):
- 用于颜色渐变(如时间数据、连续值)时可靠性高
- 仅需列出有数据的物种
- 适用于非二进制分类或连续数据
Common iTOL Errors and Fixes
常见iTOL错误及修复方案
Error: "Unknown variable 'SYMBOL_SHAPE'"
- Cause: Mixing global symbol settings with per-species data
- Fix: Switch to DATASET_BINARY format
Error: "Invalid color '1' for node X"
- Cause: DATASET_SYMBOL data format mismatch
- Fix: Use DATASET_BINARY instead, format:
species<tab>0_or_1
Symbols not appearing on tree:
- Likely cause: DATASET_SYMBOL format issues
- Fix: Convert to DATASET_BINARY
- Verify: Check that all species in config exist in tree file
错误:"Unknown variable 'SYMBOL_SHAPE'"
- 原因:将全局符号设置与每个物种的数据混合使用
- 修复:切换为DATASET_BINARY格式
错误:"Invalid color '1' for node X"
- 原因:DATASET_SYMBOL数据格式不匹配
- 修复:改用DATASET_BINARY,格式为:
物种名<制表符>0或1
符号未显示在树上:
- 可能原因:DATASET_SYMBOL格式问题
- 修复:转换为DATASET_BINARY格式
- 验证:确认配置文件中的所有物种都存在于树文件中
Species Name Compatibility
物种名称兼容性
Critical: Species names must match exactly between tree and annotation files
Common issues:
- Case sensitivity: "Alca Torda" vs "Alca_torda"
- Spaces vs underscores: Always use underscores in tree format
- Subspecies names: Handle three-part names carefully
Fix for case sensitivity:
python
undefined关键注意事项:树文件和注释文件中的物种名称必须完全匹配
常见问题:
- 大小写敏感:"Alca Torda" 与 "Alca_torda"
- 空格与下划线:树格式中请始终使用下划线
- 亚种名称:需谨慎处理三段式名称
大小写敏感问题修复:
python
undefinedConvert scientific names to tree format with case normalization
将科学名称转换为树格式并统一大小写
df['species_tree'] = df['scientific_name'].str.replace(' ', '_')
df['species_tree'] = df['scientific_name'].str.replace(' ', '_')
Fix uppercase after underscore (Alca_Torda -> Alca_torda)
修正下划线后的大写字母(Alca_Torda -> Alca_torda)
df['species_tree'] = df['species_tree'].str.replace(
r'([A-Z])',
lambda m: '' + m.group(1).lower(),
regex=True
)
**Validation pattern**:
```pythondf['species_tree'] = df['species_tree'].str.replace(
r'([A-Z])',
lambda m: '' + m.group(1).lower(),
regex=True
)
**验证代码**:
```pythonAlways validate species compatibility
务必验证物种兼容性
import re
import re
Extract species from tree
从树文件中提取物种
with open('tree.nwk') as f:
tree_content = f.read()
tree_species = set(re.findall(r'([A-Z][a-z]+_[a-z]+)', tree_content))
with open('tree.nwk') as f:
tree_content = f.read()
tree_species = set(re.findall(r'([A-Z][a-z]+_[a-z]+)', tree_content))
Check config species
检查配置文件中的物种
config_species = set(df['species_tree'])
missing = config_species - tree_species
if missing:
print(f"Species in config but not in tree: {missing}")
undefinedconfig_species = set(df['species_tree'])
missing = config_species - tree_species
if missing:
print(f"配置文件中有但树文件中没有的物种:{missing}")
undefinedColor Gradients for Temporal Data
时间数据的颜色渐变
Effective color schemes:
Temporal progression (old → new):
- Light Yellow → Dark Red (ColorBrewer YlOrRd)
- Clearly shows progression from past to present
- Example: (2019) →
#ffffcc(2025)#b10026
Avoid:
- Blue → Yellow → Red (confusing middle point)
- Diverging palettes for sequential data
ColorBrewer palettes for sequential data:
- YlOrRd: Yellow-Orange-Red (temporal, intensity)
- YlGn: Yellow-Green (growth, vegetation)
- PuBuGn: Purple-Blue-Green (water, depth)
有效的配色方案:
时间序列(旧→新):
- 浅黄→深红(ColorBrewer YlOrRd配色)
- 清晰展示从过去到现在的演变过程
- 示例:(2019年)→
#ffffcc(2025年)#b10026
需避免的方案:
- 蓝→黄→红(中间点易混淆)
- 将发散型调色板用于连续数据
适用于连续数据的ColorBrewer调色板:
- YlOrRd:黄-橙-红(时间、强度相关)
- YlGn:黄-绿(生长、植被相关)
- PuBuGn:紫-蓝-绿(水、深度相关)
Debugging Workflow
调试流程
- Generate config file
- Upload to iTOL (https://itol.embl.de)
- If errors: Save error messages to file
- Check format: BINARY vs SYMBOL vs COLORSTRIP
- Validate species names: Match against tree file
- Test with minimal dataset: 5-10 species first
- Switch formats if needed: SYMBOL → BINARY usually works
- 生成配置文件
- 上传至iTOL(https://itol.embl.de)
- 若出现错误:将错误信息保存到文件
- 检查格式:BINARY、SYMBOL还是COLORSTRIP
- 验证物种名称:与树文件中的名称匹配
- 使用最小数据集测试:先测试5-10个物种
- 必要时切换格式:通常SYMBOL转BINARY即可解决问题
Related Skills
相关技能
- data-visualization: General visualization best practices
- bioinformatics/fundamentals: Core bioinformatics concepts
- bioinformatics/phylogenetics: Phylogenetic analysis workflows
- data-visualization:通用可视化最佳实践
- bioinformatics/fundamentals:核心生物信息学概念
- bioinformatics/phylogenetics:系统发育分析工作流