bioinformatics-fundamentals

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Bioinformatics Fundamentals

生物信息学基础

Foundation knowledge for genomics and bioinformatics workflows. Provides essential understanding of file formats, sequencing technologies, and common data processing patterns.
基因组学和生物信息学工作流的基础知识。帮助您深入理解文件格式、测序技术以及常见数据处理模式。

When to Use This Skill

何时使用此技能

  • Working with sequencing data (PacBio HiFi, Hi-C, Illumina)
  • Debugging SAM/BAM alignment or filtering issues
  • Processing AGP files for genome assembly curation
  • Validating AGP coordinate systems and unloc assignments
  • Understanding paired-end vs single-end data
  • Interpreting quality metrics (MAPQ, PHRED scores)
  • Troubleshooting empty outputs or broken read pairs
  • General bioinformatics data analysis
  • 处理测序数据(PacBio HiFi、Hi-C、Illumina)
  • 调试SAM/BAM比对或过滤问题
  • 处理用于基因组组装整理的AGP文件
  • 验证AGP坐标系和未定位序列(unloc)分配
  • 理解双端与单端数据的差异
  • 解读质量指标(MAPQ、PHRED分数)
  • 排查输出为空或读取配对异常的问题
  • 常规生物信息学数据分析

SAM/BAM Format Essentials

SAM/BAM格式核心要点

SAM Flags (Bitwise)

SAM标志(按位)

Flags are additive - a read can have multiple flags set simultaneously.
Common Flags:
  • 0x0001
    (1): Read is paired in sequencing
  • 0x0002
    (2): Each segment properly aligned (proper pair)
  • 0x0004
    (4): Read unmapped
  • 0x0008
    (8): Mate unmapped
  • 0x0010
    (16): Read mapped to reverse strand
  • 0x0020
    (32): Mate mapped to reverse strand
  • 0x0040
    (64): First in pair (R1/forward)
  • 0x0080
    (128): Second in pair (R2/reverse)
  • 0x0100
    (256): Secondary alignment
  • 0x0400
    (1024): PCR or optical duplicate
  • 0x0800
    (2048): Supplementary alignment
Flag Combinations:
  • Properly paired R1:
    99
    (0x63 = 1 + 2 + 32 + 64)
  • Properly paired R2:
    147
    (0x93 = 1 + 2 + 16 + 128)
  • Unmapped read:
    4
  • Mate unmapped:
    8
标志是累加式的——一条读取可以同时设置多个标志。
常见标志:
  • 0x0001
    (1):读取在测序时为双端配对
  • 0x0002
    (2):每个片段均正确比对(有效配对)
  • 0x0004
    (4):读取未比对
  • 0x0008
    (8):配对的另一端未比对
  • 0x0010
    (16):读取比对到反向链
  • 0x0020
    (32):配对的另一端比对到反向链
  • 0x0040
    (64):双端中的第一条(R1/正向)
  • 0x0080
    (128):双端中的第二条(R2/反向)
  • 0x0100
    (256):次级比对
  • 0x0400
    (1024):PCR或光学重复
  • 0x0800
    (2048):补充比对
标志组合示例:
  • 有效配对的R1:
    99
    (0x63 = 1 + 2 + 32 + 64)
  • 有效配对的R2:
    147
    (0x93 = 1 + 2 + 16 + 128)
  • 未比对的读取:
    4
  • 配对另一端未比对:
    8

Proper Pair Flag (0x0002)

有效配对标志(0x0002)

What "proper pair" means:
  • Both R1 and R2 are mapped
  • Mapping orientations are correct (typically R1 forward, R2 reverse)
  • Insert size is reasonable for the library
  • Pair conforms to aligner's expectations
Important: Different aligners have different criteria for proper pairs!
“有效配对”的含义:
  • R1和R2均已比对
  • 比对方向正确(通常R1正向,R2反向)
  • 插入片段大小符合文库预期
  • 配对符合比对工具的要求
**重要提示:**不同比对工具对有效配对的判定标准不同!

MAPQ (Mapping Quality)

MAPQ(比对质量)

Formula:
MAPQ = -10 * log10(P(mapping is wrong))
Common Thresholds:
  • MAPQ >= 60
    : High confidence (error probability < 0.0001%)
  • MAPQ >= 30
    : Good quality (error probability < 0.1%)
  • MAPQ >= 20
    : Acceptable (error probability < 1%)
  • MAPQ >= 10
    : Low confidence (error probability < 10%)
  • MAPQ = 0
    : Multi-mapper or unmapped
Note: MAPQ=0 can mean either unmapped OR equally good multiple mappings.
公式:
MAPQ = -10 * log10(P(比对错误的概率))
常见阈值:
  • MAPQ >= 60
    :高可信度(错误概率 < 0.0001%)
  • MAPQ >= 30
    :良好质量(错误概率 < 0.1%)
  • MAPQ >= 20
    :可接受(错误概率 < 1%)
  • MAPQ >= 10
    :低可信度(错误概率 < 10%)
  • MAPQ = 0
    :多位置比对或未比对
**注意:**MAPQ=0可能表示未比对,也可能表示存在多个质量相同的比对结果。

CIGAR String

CIGAR字符串

Represents alignment between read and reference:
  • M
    : Match or mismatch (alignment match)
  • I
    : Insertion in read vs reference
  • D
    : Deletion in read vs reference
  • S
    : Soft clipping (bases in read not aligned)
  • H
    : Hard clipping (bases not in read sequence)
  • N
    : Skipped region (for RNA-seq splicing)
Example:
100M
= perfect 100bp match Example:
50M5I45M
= 50bp match, 5bp insertion, 45bp match
表示读取与参考序列的比对情况:
  • M
    :匹配或错配(比对匹配)
  • I
    :读取相对参考序列的插入
  • D
    :读取相对参考序列的缺失
  • S
    :软剪切(读取中未参与比对的碱基)
  • H
    :硬剪切(不在读取序列中的碱基)
  • N
    :跳过的区域(用于RNA-seq的可变剪接)
示例:
100M
= 完美的100bp匹配 示例:
50M5I45M
= 50bp匹配,5bp插入,45bp匹配

Sequencing Technologies

测序技术

PacBio HiFi (High Fidelity)

PacBio HiFi(高保真)

Characteristics:
  • Long reads: 10-25 kb typical
  • High accuracy: >99.9% (Q20+)
  • Circular Consensus Sequencing (CCS)
  • Single-end data (though from circular molecules)
  • Excellent for de novo assembly
Best Mappers:
  • minimap2 presets:
    map-pb
    ,
    map-hifi
  • BWA-MEM2 can work but optimized for short reads
Typical Use Cases:
  • De novo genome assembly
  • Structural variant detection
  • Isoform sequencing (Iso-Seq)
  • Haplotype phasing
特点:
  • 长读取:典型长度10-25 kb
  • 高准确率:>99.9%(Q20+)
  • 环形共识测序(CCS)
  • 单端数据(尽管来自环形分子)
  • 非常适合从头组装
推荐比对工具:
  • minimap2预设参数:
    map-pb
    map-hifi
  • BWA-MEM2可使用,但专为短读取优化
典型应用场景:
  • 基因组从头组装
  • 结构变异检测
  • 异构体测序(Iso-Seq)
  • 单倍型定相

Hi-C (Chromatin Conformation Capture)

Hi-C(染色质构象捕获)

Characteristics:
  • Paired-end short reads (typically 100-150 bp)
  • Read pairs capture chromatin interactions
  • R1 and R2 often map to different scaffolds/chromosomes
  • Requires careful proper pair handling
  • Used for scaffolding and 3D genome structure
Best Mappers:
  • BWA-MEM2 (paired-end mode)
  • BWA-MEM (paired-end mode)
Critical Concept: Hi-C read pairs intentionally map to distant loci. Region filtering can easily break pairs!
Typical Use Cases:
  • Genome scaffolding (connecting contigs)
  • 3D chromatin structure analysis
  • Haplotype phasing
  • Assembly quality assessment
特点:
  • 双端短读取(典型长度100-150 bp)
  • 读取对捕获染色质相互作用
  • R1和R2通常比对到不同的scaffold/染色体
  • 需要谨慎处理有效配对
  • 用于scaffold构建和3D基因组结构分析
推荐比对工具:
  • BWA-MEM2(双端模式)
  • BWA-MEM(双端模式)
关键概念:Hi-C读取对故意比对到远距离位点。区域过滤很容易破坏配对关系!
典型应用场景:
  • 基因组scaffold构建(连接contig)
  • 3D染色质结构分析
  • 单倍型定相
  • 组装质量评估

Illumina Short Reads

Illumina短读取

Characteristics:
  • Short reads: 50-300 bp
  • Paired-end or single-end
  • High throughput
  • Well-established quality scores
Best Mappers:
  • BWA-MEM2, BWA-MEM (general purpose)
  • Bowtie2 (fast, local alignment)
  • STAR (RNA-seq spliced alignment)
特点:
  • 短读取:长度50-300 bp
  • 支持双端或单端
  • 高通量
  • 质量评分体系成熟
推荐比对工具:
  • BWA-MEM2、BWA-MEM(通用型)
  • Bowtie2(快速,局部比对)
  • STAR(RNA-seq剪接比对)

Common Tools and Their Behaviors

常用工具及其特性

samtools view

samtools view

Purpose: Filter, convert, and view SAM/BAM files
Key Flags:
  • -b
    : Output BAM format
  • -h
    : Include header
  • -f INT
    : Require flags (keep reads WITH these flags)
  • -F INT
    : Filter flags (remove reads WITH these flags)
  • -q INT
    : Minimum MAPQ threshold
  • -L FILE
    : Keep reads overlapping regions in BED file
Important Behavior:
  • -L
    (region filtering) checks each read individually, not pairs
  • Can break read pairs if mates map to different regions
  • Flag filters (
    -f
    ,
    -F
    ) are applied before region filters (
    -L
    )
Example - Filter for proper pairs:
bash
samtools view -b -f 2 input.bam > proper_pairs.bam
Example - Filter by region (may break pairs):
bash
samtools view -b -L regions.bed input.bam > filtered.bam
Example - Proper pairs in regions (correct order):
bash
samtools view -b -f 2 -L regions.bed input.bam > proper_pairs_in_regions.bam
**用途:**过滤、转换和查看SAM/BAM文件
关键参数:
  • -b
    :输出BAM格式
  • -h
    :包含文件头
  • -f INT
    :要求包含指定标志(保留带有这些标志的读取)
  • -F INT
    :过滤指定标志(移除带有这些标志的读取)
  • -q INT
    :最低MAPQ阈值
  • -L FILE
    :保留与BED文件中区域重叠的读取
重要特性:
  • -L
    (区域过滤)会单独检查每条读取,而非读取对
  • 如果配对的两端比对到不同区域,可能会破坏读取对
  • 标志过滤(
    -f
    -F
    )会在区域过滤(
    -L
    )之前
    执行
示例 - 过滤有效配对:
bash
samtools view -b -f 2 input.bam > proper_pairs.bam
示例 - 按区域过滤(可能破坏配对):
bash
samtools view -b -L regions.bed input.bam > filtered.bam
示例 - 区域内的有效配对(正确顺序):
bash
samtools view -b -f 2 -L regions.bed input.bam > proper_pairs_in_regions.bam

bamtools filter

bamtools filter

Purpose: Advanced filtering with complex criteria
Key Features:
  • Can filter on multiple properties simultaneously
  • More strict about pair validation than samtools
  • Supports JSON filter rules
Common Filters:
  • isPaired: true
    - Read is from paired-end sequencing
  • isProperPair: true
    - Read is part of proper pair
  • isMapped: true
    - Read is mapped
  • mapQuality: >=30
    - Mapping quality threshold
Important Difference from samtools:
  • isProperPair
    is more strict than samtools
    -f 2
  • Checks pair validity more thoroughly
  • Better for ensuring R1/R2 match correctly
**用途:**使用复杂条件进行高级过滤
关键特性:
  • 可同时基于多个属性进行过滤
  • 比samtools对配对验证的要求更严格
  • 支持JSON过滤规则
常见过滤条件:
  • isPaired: true
    - 读取来自双端测序
  • isProperPair: true
    - 读取属于有效配对
  • isMapped: true
    - 读取已比对
  • mapQuality: >=30
    - 比对质量阈值
与samtools的重要区别:
  • isProperPair
    比samtools的
    -f 2
    要求更严格
  • 更全面地检查配对有效性
  • 更适合确保R1/R2正确匹配

samtools fastx

samtools fastx

Purpose: Convert SAM/BAM to FASTQ/FASTA
Output Modes:
  • outputs: ["r1", "r2"]
    - Separate forward and reverse for paired-end
  • outputs: ["other"]
    - Single output for single-end data
  • outputs: ["r0"]
    - All reads (mixed paired/unpaired)
Filtering Options:
  • inclusive_filter: ["2"]
    - Require proper pair flag
  • exclusive_filter: ["4", "8"]
    - Exclude unmapped or mate unmapped
  • exclusive_filter_all: ["8"]
    - Exclude if mate unmapped
Critical: Use appropriate filters to ensure R1/R2 files match!
**用途:**将SAM/BAM转换为FASTQ/FASTA格式
输出模式:
  • outputs: ["r1", "r2"]
    - 双端数据分离为正向和反向读取
  • outputs: ["other"]
    - 单端数据输出为单个文件
  • outputs: ["r0"]
    - 所有读取(混合双端/单端)
过滤选项:
  • inclusive_filter: ["2"]
    - 要求有效配对标志
  • exclusive_filter: ["4", "8"]
    - 排除未比对或配对另一端未比对的读取
  • exclusive_filter_all: ["8"]
    - 排除配对另一端未比对的读取
**关键提示:**使用合适的过滤条件确保R1/R2文件匹配!

Common Patterns and Best Practices

常见模式与最佳实践

Pattern 1: Filtering Paired-End Data by Regions

模式1:按区域过滤双端数据

WRONG WAY (breaks pairs):
bash
undefined
错误方式(破坏配对):
bash
undefined

Region filter first → breaks pairs when mates are in different regions

先进行区域过滤 → 当配对两端位于不同区域时,配对关系被破坏

samtools view -b -L regions.bed input.bam | bamtools filter -isPaired -isProperPair
samtools view -b -L regions.bed input.bam | bamtools filter -isPaired -isProperPair

Result: Empty output (all pairs broken)

结果:输出为空(所有配对均被破坏)


**RIGHT WAY (preserves pairs):**
```bash

**正确方式(保留配对):**
```bash

Proper pair filter FIRST, then region filter

先过滤有效配对,再进行区域过滤

samtools view -b -f 2 -L regions.bed input.bam > output.bam
samtools view -b -f 2 -L regions.bed input.bam > output.bam

Result: Pairs where both mates are in regions (or one mate in region, other anywhere)

结果:保留两端均在区域内,或一端在区域内另一端在任意位置的配对


**BEST WAY (both mates in regions):**
```bash

**最佳方式(两端均在区域内):**
```bash

Filter for proper pairs, then use paired-aware region filtering

过滤有效配对,然后使用支持配对的区域过滤

samtools view -b -f 2 input.bam | \

Custom script to keep pairs where both mates in regions

undefined
samtools view -b -f 2 input.bam | \

自定义脚本保留两端均在区域内的配对

undefined

Pattern 2: Extracting FASTQ from Filtered BAM

模式2:从过滤后的BAM中提取FASTQ

For Paired-End:
bash
undefined
双端数据:
bash
undefined

Ensure proper pairs before extraction

提取前确保为有效配对

samtools fastx -1 R1.fq.gz -2 R2.fq.gz
--i1-flags 2 \ # Require proper pair --i2-flags 64,128 \ # Separate R1/R2 input.bam

**For Single-End:**
```bash
samtools fastx -1 R1.fq.gz -2 R2.fq.gz \ --i1-flags 2 \ # 要求有效配对 --i2-flags 64,128 \ # 分离R1/R2 input.bam

**单端数据:**
```bash

Simple extraction

简单提取

samtools fastx -0 output.fq.gz input.bam
undefined
samtools fastx -0 output.fq.gz input.bam
undefined

Pattern 3: Quality Filtering

模式3:质量过滤

Conservative (high quality):
bash
samtools view -b -q 30 -f 2 -F 256 -F 2048 input.bam
保守模式(高质量):
bash
samtools view -b -q 30 -f 2 -F 256 -F 2048 input.bam

MAPQ >= 30, proper pairs, no secondary/supplementary

MAPQ >=30,有效配对,无次级/补充比对


**Permissive (for low-coverage data):**
```bash
samtools view -b -q 10 -F 4 input.bam

**宽松模式(针对低覆盖度数据):**
```bash
samtools view -b -q 10 -F 4 input.bam

MAPQ >= 10, mapped reads

MAPQ >=10,已比对的读取

undefined
undefined

Common Issues and Solutions

常见问题与解决方案

Issue 1: Empty Output After Region Filtering (Hi-C Data)

问题1:区域过滤后输出为空(Hi-C数据)

Symptom:
  • BAM file non-empty before filtering
  • Empty after region filtering + proper pair filtering
  • Happens with paired-end data (especially Hi-C)
Cause:
  • Region filter (
    samtools view -L
    ) breaks read pairs
  • One mate in region, other mate outside region
  • Proper pair flag (0x2) is lost
  • Subsequent
    isProperPair
    filter removes all reads
Solution:
bash
undefined
症状:
  • 过滤前BAM文件非空
  • 区域过滤+有效配对过滤后输出为空
  • 常发生在双端数据(尤其是Hi-C)中
原因:
  • 区域过滤(
    samtools view -L
    )破坏了读取对
  • 一端在区域内,另一端在区域外
  • 有效配对标志(0x2)丢失
  • 后续的
    isProperPair
    过滤移除了所有读取
解决方案:
bash
undefined

Apply proper pair filter BEFORE region filtering

先应用有效配对过滤,再进行区域过滤

samtools view -b -f 2 -L regions.bed input.bam > output.bam

**See Also:** `common-issues.md` for detailed troubleshooting
samtools view -b -f 2 -L regions.bed input.bam > output.bam

**另见:**`common-issues.md`获取详细故障排除指南

Issue 2: R1 and R2 Files Have Different Read Counts

问题2:R1和R2文件的读取数量不同

Symptom:
  • Forward and reverse FASTQ files have different numbers of reads
  • Downstream tools fail expecting matched pairs
Cause:
  • Improper filtering broke some pairs
  • One mate filtered out, other kept
  • Extraction didn't require proper pairing
Solution:
bash
undefined
症状:
  • 正向和反向FASTQ文件的读取数量不一致
  • 下游工具因期望配对匹配而运行失败
原因:
  • 不当的过滤破坏了部分配对
  • 一端被过滤掉,另一端被保留
  • 提取时未要求有效配对
解决方案:
bash
undefined

Require proper pairs during extraction

提取时要求有效配对

samtools fastx -1 R1.fq -2 R2.fq --i1-flags 2 input.bam
undefined
samtools fastx -1 R1.fq -2 R2.fq --i1-flags 2 input.bam
undefined

Issue 3: Low Mapping Rate for Hi-C Data

问题3:Hi-C数据的比对率低

Symptom:
  • Many Hi-C reads unmapped or low MAPQ
  • Expected for Hi-C due to chimeric reads
Not Actually a Problem:
  • Hi-C involves ligation of distant DNA fragments
  • Creates chimeric molecules
  • Mappers may mark these as low quality or unmapped
  • This is normal for Hi-C data
Solution:
  • Use Hi-C-specific pipelines (e.g., HiC-Pro, Juicer)
  • Don't filter too aggressively on MAPQ
  • Accept lower mapping rates than DNA-seq
症状:
  • 许多Hi-C读取未比对或MAPQ值低
  • 因嵌合读取,这在Hi-C中属于正常现象
并非实际问题:
  • Hi-C涉及远距离DNA片段的连接
  • 形成嵌合分子
  • 比对工具可能将这些标记为低质量或未比对
  • 这对Hi-C数据来说是正常
解决方案:
  • 使用Hi-C专用流程(如HiC-Pro、Juicer)
  • 不要过度严格地过滤MAPQ
  • 接受比DNA-seq更低的比对率

Issue 4: Proper Pairs Lost After Mapping

问题4:比对后丢失有效配对

Symptom:
  • Few reads marked as proper pairs (flag 0x2)
  • Expected paired-end data
Possible Causes:
  1. Insert size distribution wrong (check aligner parameters)
  2. Reference mismatch (reads from different assembly)
  3. Poor library quality
  4. Incorrect orientation flags passed to aligner
Solution:
bash
undefined
症状:
  • 很少有读取被标记为有效配对(标志0x2)
  • 数据应为双端数据
可能原因:
  1. 插入片段大小分布异常(检查比对工具参数)
  2. 参考序列不匹配(读取来自不同的组装版本)
  3. 文库质量差
  4. 传递给比对工具的方向标志错误
解决方案:
bash
undefined

Check insert size distribution

检查插入片段大小分布

samtools stats input.bam | grep "insert size"
samtools stats input.bam | grep "insert size"

Check pairing flags

检查配对标志

samtools flagstat input.bam
undefined
samtools flagstat input.bam
undefined

Quality Metrics

质量指标

N50 and Related Metrics

N50及相关指标

N50: Length of the shortest contig at which 50% of total assembly is contained in contigs of that length or longer
How to interpret:
  • Higher N50 = better contiguity
  • Compare to expected chromosome/scaffold sizes
  • Use with caution - can be misleading for fragmented assemblies
Related Metrics:
  • L50: Number of contigs needed to reach N50
  • N90: More stringent than N50 (90% coverage)
  • NG50: N50 relative to genome size (better for comparisons)
**N50:**当50%的总组装序列包含在该长度或更长的contig中时,最短contig的长度
解读方式:
  • N50值越高,连续性越好
  • 与预期的染色体/scaffold大小对比
  • 谨慎使用——在碎片化组装中可能产生误导
相关指标:
  • **L50:**达到N50所需的contig数量
  • **N90:**比N50更严格(覆盖90%的序列)
  • **NG50:**相对于基因组大小的N50(更适合比较)

Coverage and Depth

覆盖度与深度

Coverage: Percentage of reference bases covered by at least one read Depth: Average number of reads covering each base
Recommended Depths:
  • Genome assembly (HiFi): 30-50x
  • Variant calling: 30x minimum
  • RNA-seq: 20-40 million reads
  • Hi-C scaffolding: 50-100x genomic coverage
**覆盖度:**至少被一条读取覆盖的参考碱基的百分比 **深度:**每个碱基被读取覆盖的平均次数
推荐深度:
  • 基因组组装(HiFi):30-50x
  • 变异检测:最低30x
  • RNA-seq:20-40百万条读取
  • Hi-C scaffolding:50-100x基因组覆盖度

File Format Quick Reference

文件格式速查

FASTA

FASTA

>sequence_id description
ATCGATCGATCG
ATCGATCG
  • Header line starts with
    >
  • Can span multiple lines
  • No quality scores
>sequence_id description
ATCGATCGATCG
ATCGATCG
  • 标题行以
    >
    开头
  • 序列可跨多行
  • 无质量评分

FASTQ

FASTQ

@read_id
ATCGATCGATCG
+
IIIIIIIIIIII
  • Four lines per read
  • Quality scores (Phred+33 encoding typical)
  • Can be gzipped (.fastq.gz)
@read_id
ATCGATCGATCG
+
IIIIIIIIIIII
  • 每条读取占4行
  • 质量评分(通常为Phred+33编码)
  • 可被压缩为.gz格式(.fastq.gz)

BED

BED

chr1    1000    2000    feature_name    score    +
  • 0-based coordinates
  • Used for regions, features, intervals
  • Minimum 3 columns (chrom, start, end)
chr1    1000    2000    feature_name    score    +
  • 0-based坐标
  • 用于表示区域、特征、区间
  • 至少3列(染色体、起始、终止)

AGP

AGP

chr1    1    5000    1    W    contig_1    1    5000    +
chr1    5001 5100    2    U    100    scaffold    yes    proximity_ligation
  • Tab-delimited genome assembly format
  • 1-based closed coordinates [start, end]
  • Describes construction of objects from components
  • Object and component lengths must match
  • See AGP Format section for complete specification
chr1    1    5000    1    W    contig_1    1    5000    +
chr1    5001 5100    2    U    100    scaffold    yes    proximity_ligation
  • 制表符分隔的基因组组装格式
  • 1-based闭合坐标 [start, end]
  • 描述如何从组件构建目标序列
  • 目标序列与组件的长度必须匹配
  • 完整规范请参考AGP格式章节

Best Practices

最佳实践

General

通用原则

  1. Always check data type: Paired-end vs single-end determines filtering strategy
  2. Understand your sequencing technology: Hi-C behaves differently than HiFi
  3. Filter in the right order: Proper pairs BEFORE region filtering
  4. Validate outputs: Check file sizes, read counts, flagstat
  5. Use appropriate MAPQ thresholds: Too stringent = lost data, too permissive = noise
  1. **始终检查数据类型:**双端与单端数据决定了过滤策略
  2. **了解您的测序技术:**Hi-C与HiFi的处理方式不同
  3. **按正确顺序过滤:**先过滤有效配对,再进行区域过滤
  4. **验证输出:**检查文件大小、读取数量、flagstat结果
  5. **使用合适的MAPQ阈值:**过于严格会丢失数据,过于宽松会引入噪音

For Hi-C Data

Hi-C数据专用

  1. Expect distant read pairs: Don't be surprised by different scaffolds
  2. Preserve proper pairs: Critical for downstream scaffolding
  3. Use paired-aware tools: Standard filters may break pairs
  4. Don't over-filter on MAPQ: Hi-C often has lower MAPQ than DNA-seq
  1. **预期远距离读取对:**不要对不同scaffold的比对结果感到惊讶
  2. **保留有效配对:**对下游scaffold构建至关重要
  3. **使用支持配对的工具:**标准过滤可能破坏配对
  4. **不要过度过滤MAPQ:**Hi-C的MAPQ通常低于DNA-seq

For HiFi Data

HiFi数据专用

  1. Single-end processing: No pair concerns
  2. High quality expected: Can use strict filters
  3. Use appropriate presets: minimap2
    map-hifi
    or
    map-pb
  4. Consider read length distribution: HiFi reads vary in length
  1. **单端处理:**无需考虑配对问题
  2. **预期高质量:**可使用严格过滤条件
  3. **使用合适的预设参数:**minimap2的
    map-hifi
    map-pb
  4. **考虑读取长度分布:**HiFi读取的长度存在差异

For Tool Testing

工具测试专用

  1. Create self-contained datasets: Both mates in selected region
  2. Maintain proper pairs: Essential for realistic testing
  3. Use representative data: Subsample proportionally, not randomly
  4. Verify file sizes: Too small = overly filtered
  1. **创建独立数据集:**配对两端均在选定区域内
  2. **保留有效配对:**对真实测试至关重要
  3. **使用代表性数据:**按比例抽样,而非随机抽样
  4. **验证文件大小:**文件过小说明过滤过于严格

Genome Ark Data Retrieval

Genome Ark数据检索

Overview

概述

Genome Ark (s3://genomeark/) is a public AWS S3 bucket containing VGP assemblies and QC data. Access requires no credentials using
--no-sign-request
.
Genome Ark(s3://genomeark/)是一个公共AWS S3存储桶,包含VGP组装和QC数据。使用
--no-sign-request
参数无需凭证即可访问。

Directory Structure

目录结构

s3://genomeark/species/{SPECIES_NAME}/{TOLID}/
├── assembly_vgp_HiC_2.0/          # Standard HiC assembly
├── assembly_vgp_standard_2.1/     # Standard assembly
├── assembly_vgp_trio_1.0/         # Trio assembly
├── assembly_curated/              # Manually curated
├── assembly_{METHOD}_{VERSION}/   # Method-specific (e.g., assembly_verkko_1.4)
└── genomic_data/                  # Raw sequencing data
s3://genomeark/species/{SPECIES_NAME}/{TOLID}/
├── assembly_vgp_HiC_2.0/          # 标准HiC组装
├── assembly_vgp_standard_2.1/     # 标准组装
├── assembly_vgp_trio_1.0/         #  trio组装
├── assembly_curated/              # 人工整理的组装
├── assembly_{METHOD}_{VERSION}/   # 特定方法的组装(如assembly_verkko_1.4)
└── genomic_data/                  # 原始测序数据

GenomeScope Data Locations

GenomeScope数据位置

GenomeScope summaries can be in multiple locations - search comprehensively:
Common paths (priority order):
  1. {assembly}/evaluation/genomescope/{TOLID}_genomescope__Summary.txt
  2. {assembly}/evaluation/genomescope2/{TOLID}_genomescope__Summary.txt
  3. {assembly}/qc/genomescope/{TOLID}_genomescope__Summary.txt
  4. {assembly}/intermediates/genomescope/{TOLID}_genomescope__Summary.txt
Search strategy:
  1. Dynamically discover assembly folders (not all species use same structure)
  2. Search multiple subfolder locations per assembly
  3. Try standard assemblies first (HiC_2.0, standard_2.1)
  4. Fall back to method-specific assemblies
GenomeScope摘要可能存储在多个位置,需全面搜索:
常见路径(优先级顺序):
  1. {assembly}/evaluation/genomescope/{TOLID}_genomescope__Summary.txt
  2. {assembly}/evaluation/genomescope2/{TOLID}_genomescope__Summary.txt
  3. {assembly}/qc/genomescope/{TOLID}_genomescope__Summary.txt
  4. {assembly}/intermediates/genomescope/{TOLID}_genomescope__Summary.txt
搜索策略
  1. 动态发现组装文件夹(并非所有物种使用相同结构)
  2. 在每个组装文件夹的多个子目录中搜索
  3. 优先尝试标准组装(HiC_2.0、standard_2.1)
  4. fallback到特定方法的组装

Python Implementation Pattern

Python实现模式

python
def search_genomescope_recursive(species_name, tolid):
    """Search for GenomeScope data across all assembly folders."""
    species_s3 = species_name.replace(' ', '_')

    # 1. Discover assembly folders
    result = subprocess.run(
        ['aws', 's3', 'ls', f's3://genomeark/species/{species_s3}/{tolid}/',
         '--no-sign-request'],
        capture_output=True, text=True
    )

    assembly_folders = []
    for line in result.stdout.strip().split('\n'):
        if 'PRE' in line:
            folder = line.split('PRE')[1].strip().rstrip('/')
            if folder.startswith('assembly'):
                assembly_folders.append(folder)

    # 2. Search patterns
    patterns = [
        'evaluation/genomescope/{tolid}_genomescope__Summary.txt',
        'evaluation/genomescope2/{tolid}_genomescope__Summary.txt',
        'qc/genomescope/{tolid}_genomescope__Summary.txt',
        'intermediates/genomescope/{tolid}_genomescope__Summary.txt',
    ]

    # 3. Try each combination
    for assembly in assembly_folders:
        for pattern in patterns:
            s3_path = f's3://genomeark/species/{species_s3}/{tolid}/{assembly}/{pattern.format(tolid=tolid)}'
            result = subprocess.run(
                ['aws', 's3', 'cp', s3_path, '-', '--no-sign-request'],
                capture_output=True, text=True, timeout=10
            )
            if result.returncode == 0 and result.stdout:
                return result.stdout

    return None
python
def search_genomescope_recursive(species_name, tolid):
    """在所有组装文件夹中搜索GenomeScope数据。"""
    species_s3 = species_name.replace(' ', '_')

    # 1. 发现组装文件夹
    result = subprocess.run(
        ['aws', 's3', 'ls', f's3://genomeark/species/{species_s3}/{tolid}/',
         '--no-sign-request'],
        capture_output=True, text=True
    )

    assembly_folders = []
    for line in result.stdout.strip().split('\
'):
        if 'PRE' in line:
            folder = line.split('PRE')[1].strip().rstrip('/')
            if folder.startswith('assembly'):
                assembly_folders.append(folder)

    # 2. 搜索模式
    patterns = [
        'evaluation/genomescope/{tolid}_genomescope__Summary.txt',
        'evaluation/genomescope2/{tolid}_genomescope__Summary.txt',
        'qc/genomescope/{tolid}_genomescope__Summary.txt',
        'intermediates/genomescope/{tolid}_genomescope__Summary.txt',
    ]

    # 3. 尝试所有组合
    for assembly in assembly_folders:
        for pattern in patterns:
            s3_path = f's3://genomeark/species/{species_s3}/{tolid}/{assembly}/{pattern.format(tolid=tolid)}'
            result = subprocess.run(
                ['aws', 's3', 'cp', s3_path, '-', '--no-sign-request'],
                capture_output=True, text=True, timeout=10
            )
            if result.returncode == 0 and result.stdout:
                return result.stdout

    return None

GenomeScope Summary Parsing

GenomeScope摘要解析

python
def parse_genomescope_summary(content):
    """Extract genome characteristics from GenomeScope2 summary."""
    data = {}

    # Genome Haploid Length (max value - second column)
    match = re.search(r'Genome Haploid Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
    if match:
        data['genome_size_genomescope'] = int(match.group(1).replace(',', ''))

    # Heterozygosity percentage (max value)
    match = re.search(r'Heterozygous \(ab\)\s+[\d.]+%\s+([\d.]+)%', content)
    if match:
        data['heterozygosity_percent'] = float(match.group(1))

    # Genome Unique Length (max value)
    match = re.search(r'Genome Unique Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
    if match:
        data['unique_length'] = int(match.group(1).replace(',', ''))

    # Genome Repeat Length (max value)
    match = re.search(r'Genome Repeat Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
    if match:
        data['repeat_length'] = int(match.group(1).replace(',', ''))

    # Calculate repeat content percentage
    if 'repeat_length' in data and 'unique_length' in data:
        total = data['repeat_length'] + data['unique_length']
        if total > 0:
            data['repeat_content_percent'] = (data['repeat_length'] / total) * 100

    return data
python
def parse_genomescope_summary(content):
    """从GenomeScope2摘要中提取基因组特征。"""
    data = {}

    # 单倍体基因组长度(最大值 - 第二列)
    match = re.search(r'Genome Haploid Length\\s+[\\d,]+\\s*bp\\s+([\\d,]+)\\s*bp', content)
    if match:
        data['genome_size_genomescope'] = int(match.group(1).replace(',', ''))

    # 杂合度百分比(最大值)
    match = re.search(r'Heterozygous \\(ab\\)\\s+[\\d.]+%\\s+([\\d.]+)%', content)
    if match:
        data['heterozygosity_percent'] = float(match.group(1))

    # 基因组唯一序列长度(最大值)
    match = re.search(r'Genome Unique Length\\s+[\\d,]+\\s*bp\\s+([\\d,]+)\\s*bp', content)
    if match:
        data['unique_length'] = int(match.group(1).replace(',', ''))

    # 基因组重复序列长度(最大值)
    match = re.search(r'Genome Repeat Length\\s+[\\d,]+\\s*bp\\s+([\\d,]+)\\s*bp', content)
    if match:
        data['repeat_length'] = int(match.group(1).replace(',', ''))

    # 计算重复序列含量百分比
    if 'repeat_length' in data and 'unique_length' in data:
        total = data['repeat_length'] + data['unique_length']
        if total > 0:
            data['repeat_content_percent'] = (data['repeat_length'] / total) * 100

    return data

Best Practices

最佳实践

  1. Rate limiting: Add 0.2s delay between successful fetches to be respectful to AWS
  2. Caching: Check if file exists locally before fetching
  3. Timeout: Use 10-15s timeout per request
  4. Assembly discovery: Always discover assemblies dynamically - don't assume structure
  5. Multiple locations: GenomeScope data may be in evaluation/, qc/, or intermediates/
  6. Expected coverage: ~15-20% of VGP assemblies have GenomeScope data available
  1. 速率限制:在成功获取之间添加0.2秒延迟,以避免对AWS造成过大压力
  2. 缓存:获取前先检查文件是否已存在于本地
  3. 超时:每个请求使用10-15秒的超时时间
  4. 组装发现:始终动态发现组装文件夹——不要假设固定结构
  5. 多位置搜索:GenomeScope数据可能存储在evaluation/、qc/或intermediates/目录中
  6. 预期覆盖率:约15-20%的VGP组装提供GenomeScope数据

Common Issues

常见问题

Species not found: Some species use different naming (check exact spacing/underscores)
bash
undefined
物种未找到:部分物种使用不同的命名方式(检查空格/下划线的准确使用)
bash
undefined

Search for species

搜索物种

aws s3 ls s3://genomeark/species/ --no-sign-request | grep -i "species_name"

**Assembly folder variations**: Not all use standard names
```bash
aws s3 ls s3://genomeark/species/ --no-sign-request | grep -i "species_name"

**组装文件夹命名变化**:并非所有组装都使用标准名称
```bash

List all assemblies for a species

列出某物种的所有组装

aws s3 ls s3://genomeark/species/{SPECIES}/{TOLID}/ --no-sign-request
undefined
aws s3 ls s3://genomeark/species/{SPECIES}/{TOLID}/ --no-sign-request
undefined

AWS CLI Setup

AWS CLI设置

bash
undefined
bash
undefined

Install AWS CLI (no credentials needed for Genome Ark)

安装AWS CLI(访问Genome Ark无需凭证)

conda install -c conda-forge awscli
conda install -c conda-forge awscli

Test access

测试访问

aws s3 ls s3://genomeark/species/ --no-sign-request | head
undefined
aws s3 ls s3://genomeark/species/ --no-sign-request | head
undefined

Karyotype Data Curation and Literature Search

核型数据整理与文献检索

Overview

概述

Karyotype data (diploid 2n and haploid n chromosome numbers) is critical for genome assembly validation but rarely available via APIs. Manual literature curation is required.
核型数据(二倍体2n和单倍体n染色体数目)对基因组组装验证至关重要,但很少能通过API获取。需要手动进行文献整理。

Search Strategy

搜索策略

Effective Search Terms

有效搜索词

"{species_name} karyotype chromosome 2n"
"{species_name} diploid number karyotype"
"{genus} karyotype evolution"
"cytogenetic analysis {family_name}"
"{species_name} chromosome number diploid"
"{species_name} karyotype chromosome 2n"
"{species_name} diploid number karyotype"
"{genus} karyotype evolution"
"cytogenetic analysis {family_name}"
"{species_name} chromosome number diploid"

Best Reference Sources

推荐参考来源

  1. PubMed/PMC: Primary cytogenetic studies
  2. ResearchGate: Karyotype descriptions and figures
  3. Specialized databases:
  4. Genome assembly papers: Often mention expected karyotype
  5. Comparative cytogenetic studies: Family-level analyses
  1. PubMed/PMC:原始细胞遗传学研究
  2. ResearchGate:核型描述和图表
  3. 专用数据库
  4. 基因组组装论文:通常会提及预期核型
  5. 比较细胞遗传学研究:科水平的分析

Search Time Estimates

搜索时间估算

  • Model organisms, domestic species: 2-3 minutes
  • Well-studied taxonomic groups: 5-10 minutes
  • Rare/uncommon species: 10-20 minutes or not found
  • 模式生物、家养物种:2-3分钟
  • 研究充分的类群:5-10分钟
  • 稀有/不常见物种:10-20分钟或无法找到

Taxonomic Conservation Patterns

分类学保守模式

Mammals

哺乳动物

  • Cetaceans: Highly conserved 2n = 44, n = 22 (exceptions: pygmy sperm whale, right whale, beaked whales = 2n = 42)
  • Felidae: Conserved 2n = 38, n = 19
  • Canidae: Conserved 2n = 78, n = 39
  • Primates: Variable (great apes 2n = 48, macaques 2n = 42, marmosets 2n = 46)
  • 鲸目:高度保守的2n = 44,n = 22(例外:小抹香鲸、露脊鲸、喙鲸 = 2n = 42)
  • 猫科:保守的2n = 38,n = 19
  • 犬科:保守的2n = 78,n = 39
  • 灵长目:变异较大(类人猿2n = 48,猕猴2n = 42,狨猴2n = 46)

Birds

鸟类

  • Anatidae (waterfowl): Highly conserved 2n = 80, n = 40 across ducks, geese, swans
  • Galliformes (game birds): Typically 2n = 78, n = 39 (chicken, quail, grouse)
  • Passerines: Variable 2n = 78-82, most common 2n = 80
  • Ancestral avian karyotype: Putative 2n = 80
  • General pattern: 50.7% of birds have 2n = 78-82; 21.7% have exactly 2n = 80
  • 鸭科(水禽):高度保守的2n = 80,n = 40,涵盖鸭、鹅、天鹅
  • 鸡形目(猎鸟):通常2n = 78,n = 39(鸡、鹌鹑、松鸡)
  • 雀形目:变异范围2n = 78-82,最常见的是2n = 80
  • 祖先鸟类核型:推测为2n = 80
  • 总体模式:50.7%的鸟类2n = 78-82;21.7%的鸟类2n = 80

Reptiles

爬行动物

  • Lacertidae (wall lizards): Often 2n = 38, n = 19
  • 蜥蜴科:通常2n = 38,n = 19

Genome Assembly Interpretation

基因组组装解读

⚠️ Warning: Chromosome-level assemblies often report fewer chromosomes than actual diploid number
Why: Assemblies typically capture only:
  • Macrochromosomes (large chromosomes)
  • Larger microchromosomes
  • Small microchromosomes remain unassembled
Example: Waterfowl with 2n = 80 often have genome assemblies with 34-42 "chromosomes"
  • True karyotype: 10 macro pairs + 30 micro pairs = 80
  • Assembly: ~34-42 scaffolds (only macro + larger micros)
⚠️ 警告:染色体水平的组装通常报告的染色体数目少于实际的二倍体数目
原因:组装通常仅捕获:
  • 大染色体(macrochromosomes)
  • 较大的小染色体(microchromosomes)
  • 小型microchromosomes仍未被组装
示例:2n = 80的水禽,其基因组组装通常有34-42条“染色体”
  • 真实核型:10对大染色体 + 30对小染色体 = 80
  • 组装结果:约34-42个scaffold(仅包含大染色体和较大的小染色体)

Using Conservation for Inference

利用保守性进行推断

When specific karyotype data is unavailable but genus/family patterns are strong:
  1. High confidence inference (acceptable for publication):
    • Multiple congeneric species confirmed
    • Family-level conservation documented
    • No known exceptions in genus
  2. Document inference clearly:
    csv
    accession,taxid,species,2n,n,notes,reference
    GCA_XXX,123,Species name,80,40,Inferred from Anatidae conservation,https://family-level-study.url
  3. Priority for direct confirmation:
    • Species with conservation exceptions
    • Type specimens or reference species
    • Phylogenetically divergent lineages
当特定物种的核型数据不可用,但属/科水平的保守性很强时:
  1. 高可信度推断(可用于发表):
    • 多个同属物种已被证实
    • 科水平的保守性有文献记录
    • 属内无已知例外
  2. 清晰记录推断依据
    csv
    accession,taxid,species,2n,n,notes,reference
    GCA_XXX,123,Species name,80,40,基于鸭科保守性推断,https://family-level-study.url
  3. 优先进行直接验证的情况
    • 存在保守性例外的物种
    • 模式标本或参考物种
    • 系统发育上差异较大的类群

VGP-Specific: Sex Chromosome Adjustment

VGP专用:性染色体调整

When both sex chromosomes are in main haplotype (common in VGP assemblies):
  • Expected scaffolds = n + 1 (not n)
  • Reason: X+Y or Z+W = two distinct chromosomes
  • Check: VGP metadata column "Sex chromosomes main haplotype"
  • Patterns: "Has X and Y", "Has Z and W", "Has X1, X2, and Y"
当两条性染色体都在主要单倍型中时(在VGP组装中很常见):
  • 预期scaffold数目 = n + 1(而非n)
  • 原因:X+Y或Z+W = 两条不同的染色体
  • 检查方式:VGP元数据列“Sex chromosomes main haplotype”
  • 常见模式:“Has X and Y”、“Has Z and W”、“Has X1, X2, and Y”

Data Recording Format

数据记录格式

CSV Structure:
csv
accession,taxid,species_name,diploid_2n,haploid_n,notes,reference
GCA_XXXXXX,12345,Species name,80,40,Brief description,https://doi.org/...
Notes field examples:
  • "Standard {family} karyotype"
  • "Conserved {genus} karyotype"
  • "Inferred from {family} conservation"
  • "Unusual karyotype for family"
  • "Geographic variation reported"
CSV结构
csv
accession,taxid,species_name,diploid_2n,haploid_n,notes,reference
GCA_XXXXXX,12345,Species name,80,40,简要描述,https://doi.org/...
Notes字段示例
  • “标准{科}核型”
  • “保守{属}核型”
  • “基于{科}保守性推断”
  • “科内异常核型”
  • “存在地理变异”

Prioritization for Literature Searches

文献检索优先级

TIER 1 (>90% success rate):
  • Model organisms (zebrafish, mouse, medaka)
  • Domestic species (chicken, goat, sheep)
  • Game animals (waterfowl, deer)
  • Laboratory species (fruit fly, nematode)
TIER 2 (70-90% success rate):
  • Well-studied taxonomic groups (Podarcis lizards, corvids)
  • Conservation focus species (raptors, large mammals)
  • Commercial species (salmonids, oysters)
TIER 3 (50-70% success rate):
  • Common but not economically important
  • Widespread distribution
  • Recent phylogenetic interest
Low priority (<50% success rate):
  • Deep-sea species
  • Rare/endangered without conservation genetics
  • Recently described species
  • Cryptic species complexes
TIER 1(成功率>90%):
  • 模式生物(斑马鱼、小鼠、青鳉)
  • 家养物种(鸡、山羊、绵羊)
  • 狩猎动物(水禽、鹿)
  • 实验物种(果蝇、线虫)
TIER 2(成功率70-90%):
  • 研究充分的类群(Podarcis蜥蜴、鸦科)
  • 保护重点物种(猛禽、大型哺乳动物)
  • 商业物种(鲑科、牡蛎)
TIER 3(成功率50-70%):
  • 常见但无经济重要性
  • 分布广泛
  • 近期系统发育研究热点
低优先级(成功率<50%):
  • 深海物种
  • 无保护遗传学研究的稀有/濒危物种
  • 近期描述的物种
  • 隐存种复合体

AGP Format (A Golden Path)

AGP格式(A Golden Path)

Overview

概述

AGP (A Golden Path) format describes how assembled sequences (chromosomes, scaffolds) are constructed from component sequences (contigs, scaffolds) and gaps. Critical for genome assembly curation and submission to NCBI/EBI.
AGP(A Golden Path)格式描述了如何从组件序列(contig、scaffold)和间隙构建组装序列(染色体、scaffold)。对基因组组装整理以及向NCBI/EBI提交数据至关重要。

When to Use This Knowledge

何时使用此知识

  • Processing genome assemblies for submission to databases
  • Curating chromosome-level assemblies
  • Splitting haplotype assemblies
  • Assigning unlocalized scaffolds (unlocs)
  • Debugging AGP validation errors
  • Converting between assembly representations
  • 处理用于数据库提交的基因组组装
  • 整理染色体水平的组装
  • 拆分单倍体组装
  • 分配未定位scaffold(unlocs)
  • 调试AGP验证错误
  • 在不同组装表示形式之间转换

AGP Format Structure

AGP格式结构

Tab-delimited format with 9 columns for sequence lines (type W) or 8+ columns for gap lines (type U/N)
Sequence Lines (Column 5 = 'W'):
object  obj_beg  obj_end  part_num  W  component_id  comp_beg  comp_end  orientation
Gap Lines (Column 5 = 'U' or 'N'):
object  obj_beg  obj_end  part_num  gap_type  gap_length  gap_type  linkage  linkage_evidence
序列行(第5列='W')为9列,间隙行(第5列='U'/'N')为8+列的制表符分隔格式
序列行(第5列='W'):
object  obj_beg  obj_end  part_num  W  component_id  comp_beg  comp_end  orientation
间隙行(第5列='U'/'N'):
object  obj_beg  obj_end  part_num  gap_type  gap_length  gap_type  linkage  linkage_evidence

Critical Coordinate Rules

关键坐标规则

Rule 1: Object and Component Lengths MUST Match
For sequence lines, the span in the object MUST equal the span in the component:
(obj_end - obj_beg + 1) == (comp_end - comp_beg + 1)
Example - CORRECT:
Scaffold_47_unloc_1  1  54360  1  W  scaffold_23.hap1  19274039  19328398  -
规则1:目标序列与组件的长度必须匹配
对于序列行,目标序列中的跨度必须与组件中的跨度相等:
(obj_end - obj_beg + 1) == (comp_end - comp_beg + 1)
示例 - 正确:
Scaffold_47_unloc_1  1  54360  1  W  scaffold_23.hap1  19274039  19328398  -

Object length: 54360 - 1 + 1 = 54,360 bp

目标序列长度:54360 - 1 + 1 = 54,360 bp

Component length: 19328398 - 19274039 + 1 = 54,360 bp ✓

组件长度:19328398 - 19274039 + 1 = 54,360 bp ✓


**Example - INCORRECT:**
Scaffold_47_unloc_1 1 19328398 1 W scaffold_23.hap1 19274039 19328398 -

**示例 - 错误:**
Scaffold_47_unloc_1 1 19328398 1 W scaffold_23.hap1 19274039 19328398 -

Object length: 19328398 - 1 + 1 = 19,328,398 bp

目标序列长度:19328398 - 1 + 1 = 19,328,398 bp

Component length: 19328398 - 19274039 + 1 = 54,360 bp ✗

组件长度:19328398 - 19274039 + 1 = 54,360 bp ✗

ERROR: Lengths don't match!

错误:长度不匹配!


**Rule 2: Component Numbering Restarts for New Objects**

Each object (column 1) has its own component numbering (column 4) starting at 1:
Scaffold_10 1 30578279 1 W scaffold_4.hap2 1 30578279 - Scaffold_10_unloc_1 1 65764 1 W scaffold_74.hap2 1 65764 + # ← Starts at 1, not 3!

**Rule 3: Sequential Component Numbering Within Objects**

Component numbers increment sequentially (gaps and sequences both count):
Scaffold_2 1 1731008 1 W scaffold_25.hap1 1 1731008 - Scaffold_2 1731009 1731108 2 U 100 scaffold yes proximity_ligation Scaffold_2 1731109 1956041 3 W scaffold_70.hap1 1 224933 -
undefined

**规则2:组件编号对新的目标序列重新开始**

每个目标序列(第1列)有自己的组件编号(第4列),从1开始:
Scaffold_10 1 30578279 1 W scaffold_4.hap2 1 30578279 - Scaffold_10_unloc_1 1 65764 1 W scaffold_74.hap2 1 65764 + # ← 从1开始,而非3!

**规则3:目标序列内的组件编号连续递增**

组件编号连续递增(间隙和序列都计数):
Scaffold_2 1 1731008 1 W scaffold_25.hap1 1 1731008 - Scaffold_2 1731009 1731108 2 U 100 scaffold yes proximity_ligation Scaffold_2 1731109 1956041 3 W scaffold_70.hap1 1 224933 -
undefined

Common AGP Processing Issues

常见AGP处理问题

Issue 1: Incorrect Object Coordinates When Creating Unlocs

问题1:创建Unlocs时目标序列坐标错误

Symptom:
ERROR: object coordinates (1, 19328398) and component coordinates (19274039, 19328398)
do not have the same length
Cause: When converting a region of a scaffold into an unlocalized sequence (unloc), the object coordinates must represent the length of the extracted region, not the original component end coordinate.
Wrong Approach:
python
undefined
症状:
ERROR: object coordinates (1, 19328398) and component coordinates (19274039, 19328398)
do not have the same length
原因: 当将scaffold的一个区域转换为未定位序列(unloc)时,目标序列坐标必须表示提取区域的长度,而非原始组件的终止坐标。
错误方法:
python
undefined

Setting object end to component end coordinate

将目标序列终止坐标设置为组件终止坐标

agp_df.loc[index, 'chr_end'] = agp_df.loc[index, 'scaff_end'] # ✗ WRONG

**Correct Approach:**
```python
agp_df.loc[index, 'chr_end'] = agp_df.loc[index, 'scaff_end'] # ✗ 错误

**正确方法:**
```python

Calculate actual length from component coordinates

根据组件坐标计算实际长度

agp_df.loc[index, 'chr_end'] = int(agp_df.loc[index, 'scaff_end']) - int(agp_df.loc[index, 'scaff_start']) + 1 # ✓ CORRECT
undefined
agp_df.loc[index, 'chr_end'] = int(agp_df.loc[index, 'scaff_end']) - int(agp_df.loc[index, 'scaff_start']) + 1 # ✓ 正确
undefined

Issue 2: Component Numbering Not Reset for New Objects

问题2:新目标序列的组件编号未重置

Symptom: Unloc scaffolds have component numbers > 1 when they should start at 1.
Cause: When creating a new object (unloc scaffold), component numbering wasn't reset.
Solution:
python
undefined
症状: Unloc scaffold的组件编号>1,但本应从1开始。
原因: 创建新的目标序列(unloc scaffold)时,组件编号未重置。
解决方案:
python
undefined

When creating unlocs, reset component number

创建unloc时,重置组件编号

agp_df.loc[index, '#_scaffs'] = 1 # Column 4: component number
undefined
agp_df.loc[index, '#_scaffs'] = 1 # 第4列:组件编号
undefined

Issue 3: AGPcorrect Accumulating Coordinates

问题3:AGPcorrect累积坐标

Symptom: Unloc sequences inherit cumulative coordinates from parent scaffolds.
Cause: AGPcorrect adjusts coordinates based on sequence length corrections. When scaffolds are later split into unlocs, the accumulated corrections need to be recalculated based on actual component spans.
Solution: Always recalculate object coordinates from component spans when creating new objects (unlocs).
症状: Unloc序列继承了来自父scaffold的累积坐标。
原因: AGPcorrect会根据序列长度校正调整坐标。当scaffold随后被拆分为unloc时,需要基于实际组件跨度重新计算累积校正值。
解决方案: 创建新目标序列(unloc)时,始终根据组件跨度重新计算目标序列坐标。

AGP Processing Best Practices

AGP处理最佳实践

1. Coordinate System Understanding

1. 理解坐标系

  • Object coordinates (columns 2-3): Position within the assembled object (1-based, inclusive)
  • Component coordinates (columns 7-8): Position within the source sequence (1-based, inclusive)
  • Both use 1-based closed intervals [start, end]
  • Length calculation:
    end - start + 1
  • **目标序列坐标(第2-3列):**在组装目标序列中的位置(1-based,包含首尾)
  • **组件坐标(第7-8列):**在源序列中的位置(1-based,包含首尾)
  • 两者均使用1-based闭合区间 [start, end]
  • 长度计算:
    end - start + 1

2. Creating Unlocalized Sequences (Unlocs)

2. 创建未定位序列(Unlocs)

python
undefined
python
undefined

When extracting a region to create an unloc:

提取区域创建unloc时:

1. Calculate the actual length of the region

1. 计算区域的实际长度

length = int(comp_end) - int(comp_start) + 1
length = int(comp_end) - int(comp_start) + 1

2. Set object coordinates for the new unloc

2. 设置新unloc的目标序列坐标

obj_start = 1 # Always starts at 1 obj_end = length # Equals the length
obj_start = 1 # 始终从1开始 obj_end = length # 等于区域长度

3. Reset component number

3. 重置组件编号

component_num = 1 # New object, new numbering
component_num = 1 # 新目标序列,新编号

4. Rename the object

4. 重命名目标序列

new_object_name = f"{parent_scaffold}unloc{unloc_number}"
undefined
new_object_name = f"{parent_scaffold}unloc{unloc_number}"
undefined

3. Validating AGP Files

3. 验证AGP文件

Use NCBI's AGP validator:
bash
agp_validate assembly.agp
Common validation checks:
  • Object/component length match
  • Sequential component numbering
  • No coordinate overlaps
  • Gap specifications valid
  • Orientation values (+, -, ?, 0, na)
使用NCBI的AGP验证工具:
bash
agp_validate assembly.agp
常见验证检查项:
  • 目标序列/组件长度匹配
  • 组件编号连续
  • 无坐标重叠
  • 间隙规范有效
  • 方向值合法(+、-、?、0、na)

4. Handling Haplotype-Split Assemblies

4. 处理单倍体拆分组装

When splitting diploid assemblies into haplotypes:
  1. Identify haplotype markers in sequence names (H1/hap1, H2/hap2)
  2. Maintain proper pairing information
  3. Process unlocs separately per haplotype
  4. Remove haplotig duplications
  5. Track gaps appropriately (especially proximity ligation gaps)
拆分二倍体组装为单倍体时:
  1. 识别序列名称中的单倍体标记(H1/hap1、H2/hap2)
  2. 保留正确的配对信息
  3. 按单倍体分别处理unloc
  4. 移除单倍型重复序列
  5. 正确处理间隙(尤其是邻近连接间隙)

AGP Coordinate Debugging Pattern

AGP坐标调试模式

When encountering coordinate errors:
python
undefined
遇到坐标错误时:
python
undefined

For each AGP line, verify:

对每个AGP行,验证:

obj_length = int(obj_end) - int(obj_beg) + 1 comp_length = int(comp_end) - int(comp_beg) + 1
assert obj_length == comp_length, f"Length mismatch: obj={obj_length}, comp={comp_length}"
obj_length = int(obj_end) - int(obj_beg) + 1 comp_length = int(comp_end) - int(comp_beg) + 1
assert obj_length == comp_length, f"长度不匹配:obj={obj_length}, comp={comp_length}"

For sequential component numbers:

验证组件编号连续:

assert comp_num == expected_num, f"Component number gap: got {comp_num}, expected {expected_num}"
undefined
assert comp_num == expected_num, f"组件编号不连续:实际{comp_num},预期{expected_num}"
undefined

AGP File Structure by Assembly Stage

不同组装阶段的AGP文件结构

1. Raw Assembly AGP:
  • Direct representation from assembler
  • May have incorrect sequence lengths
  • Needs coordinate correction (AGPcorrect)
2. Corrected AGP:
  • Sequence lengths match actual FASTA
  • Coordinates adjusted for length discrepancies
  • Ready for haplotype splitting
3. Haplotype-Split AGP:
  • Separate files per haplotype
  • Unlocs identified but not separated
  • Haplotigs marked but not removed
4. Final Curated AGP:
  • Unlocs separated into individual objects
  • Haplotigs removed to separate file
  • Proximity ligation gaps cleaned
  • Ready for database submission
1. 原始组装AGP:
  • 直接来自组装工具的输出
  • 可能存在序列长度错误
  • 需要坐标校正(AGPcorrect)
2. 校正后的AGP:
  • 序列长度与实际FASTA匹配
  • 已针对长度差异调整坐标
  • 可用于单倍体拆分
3. 单倍体拆分AGP:
  • 按单倍体分离的文件
  • Unloc已识别但未分离
  • 单倍型序列已标记但未移除
4. 最终整理后的AGP:
  • Unloc已分离为独立的目标序列
  • 单倍型序列已移除到单独文件
  • 邻近连接间隙已清理
  • 可提交到数据库

BED File Processing and Telomere Analysis

BED文件处理与端粒分析

Pattern: Classifying Scaffolds by Telomere Types

模式:按端粒类型分类Scaffold

When analyzing telomere data from BED files to classify scaffolds:
File Structure:
  • Terminal telomeres BED: columns include scaffold, start, end, orientation (p/q), accession
  • Interstitial telomeres BED: similar structure with position markers (p/q/u for internal)
Best Practice - Use Python CSV Module:
python
import csv
from collections import defaultdict
从BED文件分析端粒数据以分类scaffold时:
文件结构:
  • 末端端粒BED:列包含scaffold、起始、终止、方向(p/q)、accession
  • 中间端粒BED:结构类似,包含位置标记(p/q/u表示内部)
最佳实践 - 使用Python CSV模块:
python
import csv
from collections import defaultdict

Use defaultdict for automatic initialization

使用defaultdict自动初始化

telomere_counts = defaultdict(lambda: {'terminal': 0, 'interstitial': 0})
elomere_counts = defaultdict(lambda: {'terminal': 0, 'interstitial': 0})

Process with csv.reader (more portable than pandas)

使用csv.reader处理(比pandas更便携)

with open('telomeres.bed', 'r') as f: reader = csv.reader(f, delimiter='\t') for row in reader: scaffold = row[0] accession = row[10] # GCA accession key = (accession, scaffold) telomere_counts[key]['terminal'] += 1

**Why CSV over pandas**:
- No external dependencies (pandas may not be installed)
- Faster for simple tabular operations
- Lower memory footprint for large files
- Better portability across environments

**Classification Categories**:
1. Category 1: 2 terminal telomeres, 0 interstitial (complete chromosomes)
2. Category 2: 1 terminal telomere, 0 interstitial (partial)
3. Category 3: Has interstitial telomeres (likely assembly issues)
with open('telomeres.bed', 'r') as f: reader = csv.reader(f, delimiter='\t') for row in reader: scaffold = row[0] accession = row[10] # GCA accession key = (accession, scaffold) telomere_counts[key]['terminal'] += 1

**为什么用CSV而非pandas:**
- 无外部依赖(pandas可能未安装)
- 简单表格操作速度更快
- 大文件内存占用更低
- 跨环境可移植性更好

**分类类别:**
1. 类别1:2个末端端粒,0个中间端粒(完整染色体)
2. 类别2:1个末端端粒,0个中间端粒(部分完整)
3. 类别3:存在中间端粒(可能存在组装问题)

NCBI Data Integration Strategies

NCBI数据整合策略

Check Existing Data Sources Before API Calls

API调用前先检查现有数据源

Problem: Need chromosome counts for 400+ assemblies from NCBI.
Anti-pattern: Query NCBI datasets API for each accession
python
undefined
**问题:**需要从NCBI获取400+个组装的染色体数目。
**反模式:**为每个accession查询NCBI datasets API
python
undefined

DON'T: Query 400+ times

不要这样做:查询400+次

for accession in missing_data: result = subprocess.run(['datasets', 'summary', 'genome', 'accession', accession]) # Takes 10+ minutes, hits API rate limits

**Better Pattern**: Check if data already exists in compiled tables
```python
for accession in missing_data: result = subprocess.run(['datasets', 'summary', 'genome', 'accession', accession]) # 耗时10+分钟,触发API速率限制

**更好的模式:**检查已有的汇总表格
```python

DO: Look for existing compiled data first

应该这样做:先查找已有的汇总数据

VGP table has multiple chromosome count columns:

VGP表格包含多个染色体数字段:

- num_chromosomes (column 54)

- num_chromosomes(第54列)

- total_number_of_chromosomes (column 106)

- total_number_of_chromosomes(第106列)

- num_chromosomes_haploid (column 122)

- num_chromosomes_haploid(第122列)

Read from existing comprehensive table

从现有综合表格中读取

with open('VGP-table.csv') as f: reader = csv.reader(f) header = next(reader) for row in reader: num_chr = row[53] if row[53] else row[105] # Fallback strategy

**Results**: Filled 392/417 missing values instantly vs 10+ minutes of API calls.

**Fallback Strategy for Multiple Columns**:
```python
with open('VGP-table.csv') as f: reader = csv.reader(f) header = next(reader) for row in reader: num_chr = row[53] if row[53] else row[105] # fallback策略

**结果:**瞬间填充了392/417个缺失值,而API调用需要10+分钟。

**多列Fallback策略:**
```python

Try multiple sources in order of preference

按优先级尝试多个来源

num_chromosomes = row[53] if (len(row) > 53 and row[53]) else '' if not num_chromosomes and len(row) > 105: num_chromosomes = row[105] # Alternative column

**When to use NCBI API**:
- Data not in existing tables
- Need real-time/latest data
- Fetching assembly reports or sequence data
- Small number of queries (<20)

**API Best Practices** (when necessary):
- Use full path to datasets command (may be aliased)
- Add delays between calls (`time.sleep(0.5)`)
- Set reasonable timeouts
- Handle errors gracefully
num_chromosomes = row[53] if (len(row) > 53 and row[53]) else '' if not num_chromosomes and len(row) > 105: num_chromosomes = row[105] # 备选列

**何时使用NCBI API:**
- 数据不在现有表格中
- 需要实时/最新数据
- 获取组装报告或序列数据
- 查询数量少(<20)

**API最佳实践**(必要时):
- 使用datasets命令的完整路径(可能被别名替换)
- 在调用之间添加延迟(`time.sleep(0.5)`)
- 设置合理的超时时间
- 优雅处理错误

Related Skills

相关技能

  • vgp-pipeline - VGP workflows process Hi-C and HiFi data
  • galaxy-tool-wrapping - Galaxy tools work with SAM/BAM and sequencing data formats
  • galaxy-workflow-development - Workflows process sequencing data
  • vgp-pipeline - VGP流程处理Hi-C和HiFi数据
  • galaxy-tool-wrapping - Galaxy工具处理SAM/BAM和测序数据格式
  • galaxy-workflow-development - 流程处理测序数据

Supporting Documentation

支持文档

  • reference.md: Detailed format specifications and tool documentation
  • common-issues.md: Comprehensive troubleshooting guide with examples
  • reference.md: 详细的格式规范和工具文档
  • common-issues.md: 包含示例的综合故障排除指南

Version History

版本历史

  • v1.1.1: Added BED file processing patterns for telomere analysis and NCBI data integration strategies
  • v1.1.0: Added comprehensive AGP format documentation including coordinate validation, unloc processing, and common error patterns
  • v1.0.0: Initial release with SAM/BAM, Hi-C, HiFi, common filtering patterns
  • v1.1.1: 添加了端粒分析的BED文件处理模式和NCBI数据整合策略
  • v1.1.0: 添加了全面的AGP格式文档,包括坐标验证、unloc处理和常见错误模式
  • v1.0.0: 初始版本,包含SAM/BAM、Hi-C、HiFi、常见过滤模式",