bioinformatics-fundamentals

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Bioinformatics Fundamentals

生物信息学基础

Foundation knowledge for genomics and bioinformatics workflows. Provides essential understanding of file formats, sequencing technologies, and common data processing patterns.

基因组学和生物信息学工作流的基础知识。帮助您深入理解文件格式、测序技术以及常见数据处理模式。

When to Use This Skill

何时使用此技能

Working with sequencing data (PacBio HiFi, Hi-C, Illumina)
Debugging SAM/BAM alignment or filtering issues
Processing AGP files for genome assembly curation
Validating AGP coordinate systems and unloc assignments
Understanding paired-end vs single-end data
Interpreting quality metrics (MAPQ, PHRED scores)
Troubleshooting empty outputs or broken read pairs
General bioinformatics data analysis

处理测序数据（PacBio HiFi、Hi-C、Illumina）
调试SAM/BAM比对或过滤问题
处理用于基因组组装整理的AGP文件
验证AGP坐标系和未定位序列（unloc）分配
理解双端与单端数据的差异
解读质量指标（MAPQ、PHRED分数）
排查输出为空或读取配对异常的问题
常规生物信息学数据分析

SAM/BAM Format Essentials

SAM/BAM格式核心要点

SAM Flags (Bitwise)

SAM标志（按位）

Flags are additive - a read can have multiple flags set simultaneously.

Common Flags:

```
0x0001
```
(1): Read is paired in sequencing
```
0x0002
```
(2): Each segment properly aligned (proper pair)
```
0x0004
```
(4): Read unmapped
```
0x0008
```
(8): Mate unmapped
```
0x0010
```
(16): Read mapped to reverse strand
```
0x0020
```
(32): Mate mapped to reverse strand
```
0x0040
```
(64): First in pair (R1/forward)
```
0x0080
```
(128): Second in pair (R2/reverse)
```
0x0100
```
(256): Secondary alignment
```
0x0400
```
(1024): PCR or optical duplicate
```
0x0800
```
(2048): Supplementary alignment

Flag Combinations:

Properly paired R1:
```
99
```
(0x63 = 1 + 2 + 32 + 64)
Properly paired R2:
```
147
```
(0x93 = 1 + 2 + 16 + 128)
Unmapped read:
```
4
```
Mate unmapped:
```
8
```

标志是累加式的——一条读取可以同时设置多个标志。

常见标志：

```
0x0001
```
（1）：读取在测序时为双端配对
```
0x0002
```
（2）：每个片段均正确比对（有效配对）
```
0x0004
```
（4）：读取未比对
```
0x0008
```
（8）：配对的另一端未比对
```
0x0010
```
（16）：读取比对到反向链
```
0x0020
```
（32）：配对的另一端比对到反向链
```
0x0040
```
（64）：双端中的第一条（R1/正向）
```
0x0080
```
（128）：双端中的第二条（R2/反向）
```
0x0100
```
（256）：次级比对
```
0x0400
```
（1024）：PCR或光学重复
```
0x0800
```
（2048）：补充比对

标志组合示例：

有效配对的R1：
```
99
```
（0x63 = 1 + 2 + 32 + 64）
有效配对的R2：
```
147
```
（0x93 = 1 + 2 + 16 + 128）
未比对的读取：
```
4
```
配对另一端未比对：
```
8
```

Proper Pair Flag (0x0002)

有效配对标志（0x0002）

What "proper pair" means:

Both R1 and R2 are mapped
Mapping orientations are correct (typically R1 forward, R2 reverse)
Insert size is reasonable for the library
Pair conforms to aligner's expectations

Important: Different aligners have different criteria for proper pairs!

“有效配对”的含义：

R1和R2均已比对
比对方向正确（通常R1正向，R2反向）
插入片段大小符合文库预期
配对符合比对工具的要求

**重要提示：**不同比对工具对有效配对的判定标准不同！

MAPQ (Mapping Quality)

MAPQ（比对质量）

Formula:

MAPQ = -10 * log10(P(mapping is wrong))

Common Thresholds:

```
MAPQ >= 60
```
: High confidence (error probability < 0.0001%)
```
MAPQ >= 30
```
: Good quality (error probability < 0.1%)
```
MAPQ >= 20
```
: Acceptable (error probability < 1%)
```
MAPQ >= 10
```
: Low confidence (error probability < 10%)
```
MAPQ = 0
```
: Multi-mapper or unmapped

Note: MAPQ=0 can mean either unmapped OR equally good multiple mappings.

公式：

MAPQ = -10 * log10(P(比对错误的概率))

常见阈值：

```
MAPQ >= 60
```
：高可信度（错误概率 < 0.0001%）
```
MAPQ >= 30
```
：良好质量（错误概率 < 0.1%）
```
MAPQ >= 20
```
：可接受（错误概率 < 1%）
```
MAPQ >= 10
```
：低可信度（错误概率 < 10%）
```
MAPQ = 0
```
：多位置比对或未比对

**注意：**MAPQ=0可能表示未比对，也可能表示存在多个质量相同的比对结果。

CIGAR String

CIGAR字符串

Represents alignment between read and reference:

```
M
```
: Match or mismatch (alignment match)
```
I
```
: Insertion in read vs reference
```
D
```
: Deletion in read vs reference
```
S
```
: Soft clipping (bases in read not aligned)
```
H
```
: Hard clipping (bases not in read sequence)
```
N
```
: Skipped region (for RNA-seq splicing)

Example:

100M

= perfect 100bp match Example:

50M5I45M

= 50bp match, 5bp insertion, 45bp match

表示读取与参考序列的比对情况：

```
M
```
：匹配或错配（比对匹配）
```
I
```
：读取相对参考序列的插入
```
D
```
：读取相对参考序列的缺失
```
S
```
：软剪切（读取中未参与比对的碱基）
```
H
```
：硬剪切（不在读取序列中的碱基）
```
N
```
：跳过的区域（用于RNA-seq的可变剪接）

示例：

100M

= 完美的100bp匹配 示例：

50M5I45M

= 50bp匹配，5bp插入，45bp匹配

Sequencing Technologies

测序技术

PacBio HiFi (High Fidelity)

PacBio HiFi（高保真）

Characteristics:

Long reads: 10-25 kb typical
High accuracy: >99.9% (Q20+)
Circular Consensus Sequencing (CCS)
Single-end data (though from circular molecules)
Excellent for de novo assembly

Best Mappers:

minimap2 presets:
```
map-pb
```
,
```
map-hifi
```
BWA-MEM2 can work but optimized for short reads

Typical Use Cases:

De novo genome assembly
Structural variant detection
Isoform sequencing (Iso-Seq)
Haplotype phasing

特点：

长读取：典型长度10-25 kb
高准确率：>99.9%（Q20+）
环形共识测序（CCS）
单端数据（尽管来自环形分子）
非常适合从头组装

推荐比对工具：

minimap2预设参数：
```
map-pb
```
、
```
map-hifi
```
BWA-MEM2可使用，但专为短读取优化

典型应用场景：

基因组从头组装
结构变异检测
异构体测序（Iso-Seq）
单倍型定相

Hi-C (Chromatin Conformation Capture)

Hi-C（染色质构象捕获）

Characteristics:

Paired-end short reads (typically 100-150 bp)
Read pairs capture chromatin interactions
R1 and R2 often map to different scaffolds/chromosomes
Requires careful proper pair handling
Used for scaffolding and 3D genome structure

Best Mappers:

BWA-MEM2 (paired-end mode)
BWA-MEM (paired-end mode)

Critical Concept: Hi-C read pairs intentionally map to distant loci. Region filtering can easily break pairs!

Typical Use Cases:

Genome scaffolding (connecting contigs)
3D chromatin structure analysis
Haplotype phasing
Assembly quality assessment

特点：

双端短读取（典型长度100-150 bp）
读取对捕获染色质相互作用
R1和R2通常比对到不同的scaffold/染色体
需要谨慎处理有效配对
用于scaffold构建和3D基因组结构分析

推荐比对工具：

BWA-MEM2（双端模式）
BWA-MEM（双端模式）

关键概念：Hi-C读取对故意比对到远距离位点。区域过滤很容易破坏配对关系！

典型应用场景：

基因组scaffold构建（连接contig）
3D染色质结构分析
单倍型定相
组装质量评估

Illumina Short Reads

Illumina短读取

Characteristics:

Short reads: 50-300 bp
Paired-end or single-end
High throughput
Well-established quality scores

Best Mappers:

BWA-MEM2, BWA-MEM (general purpose)
Bowtie2 (fast, local alignment)
STAR (RNA-seq spliced alignment)

特点：

短读取：长度50-300 bp
支持双端或单端
高通量
质量评分体系成熟

推荐比对工具：

BWA-MEM2、BWA-MEM（通用型）
Bowtie2（快速，局部比对）
STAR（RNA-seq剪接比对）

Common Tools and Their Behaviors

常用工具及其特性

samtools view

Purpose: Filter, convert, and view SAM/BAM files

Key Flags:

```
-b
```
: Output BAM format
```
-h
```
: Include header
```
-f INT
```
: Require flags (keep reads WITH these flags)
```
-F INT
```
: Filter flags (remove reads WITH these flags)
```
-q INT
```
: Minimum MAPQ threshold
```
-L FILE
```
: Keep reads overlapping regions in BED file

Important Behavior:

```
-L
```
(region filtering) checks each read individually, not pairs
Can break read pairs if mates map to different regions
Flag filters (
```
-f
```
,
```
-F
```
) are applied before region filters (
```
-L
```
)

Example - Filter for proper pairs:

bash

samtools view -b -f 2 input.bam > proper_pairs.bam

Example - Filter by region (may break pairs):

bash

samtools view -b -L regions.bed input.bam > filtered.bam

Example - Proper pairs in regions (correct order):

bash

samtools view -b -f 2 -L regions.bed input.bam > proper_pairs_in_regions.bam

**用途：**过滤、转换和查看SAM/BAM文件

关键参数：

```
-b
```
：输出BAM格式
```
-h
```
：包含文件头
```
-f INT
```
：要求包含指定标志（保留带有这些标志的读取）
```
-F INT
```
：过滤指定标志（移除带有这些标志的读取）
```
-q INT
```
：最低MAPQ阈值
```
-L FILE
```
：保留与BED文件中区域重叠的读取

重要特性：

```
-L
```
（区域过滤）会单独检查每条读取，而非读取对
如果配对的两端比对到不同区域，可能会破坏读取对
标志过滤（
```
-f
```
、
```
-F
```
）会在区域过滤（
-L
）之前执行

示例 - 过滤有效配对：

bash

samtools view -b -f 2 input.bam > proper_pairs.bam

示例 - 按区域过滤（可能破坏配对）：

bash

samtools view -b -L regions.bed input.bam > filtered.bam

示例 - 区域内的有效配对（正确顺序）：

bash

samtools view -b -f 2 -L regions.bed input.bam > proper_pairs_in_regions.bam

bamtools filter

Purpose: Advanced filtering with complex criteria

Key Features:

Can filter on multiple properties simultaneously
More strict about pair validation than samtools
Supports JSON filter rules

Common Filters:

```
isPaired: true
```
- Read is from paired-end sequencing
```
isProperPair: true
```
- Read is part of proper pair
```
isMapped: true
```
- Read is mapped
```
mapQuality: >=30
```
- Mapping quality threshold

Important Difference from samtools:

```
isProperPair
```
is more strict than samtools
```
-f 2
```
Checks pair validity more thoroughly
Better for ensuring R1/R2 match correctly

**用途：**使用复杂条件进行高级过滤

关键特性：

可同时基于多个属性进行过滤
比samtools对配对验证的要求更严格
支持JSON过滤规则

常见过滤条件：

```
isPaired: true
```
- 读取来自双端测序
```
isProperPair: true
```
- 读取属于有效配对
```
isMapped: true
```
- 读取已比对
```
mapQuality: >=30
```
- 比对质量阈值

与samtools的重要区别：

```
isProperPair
```
比samtools的
```
-f 2
```
要求更严格
更全面地检查配对有效性
更适合确保R1/R2正确匹配

samtools fastx

Purpose: Convert SAM/BAM to FASTQ/FASTA

Output Modes:

```
outputs: ["r1", "r2"]
```
- Separate forward and reverse for paired-end
```
outputs: ["other"]
```
- Single output for single-end data
```
outputs: ["r0"]
```
- All reads (mixed paired/unpaired)

Filtering Options:

```
inclusive_filter: ["2"]
```
- Require proper pair flag
```
exclusive_filter: ["4", "8"]
```
- Exclude unmapped or mate unmapped
```
exclusive_filter_all: ["8"]
```
- Exclude if mate unmapped

Critical: Use appropriate filters to ensure R1/R2 files match!

**用途：**将SAM/BAM转换为FASTQ/FASTA格式

输出模式：

```
outputs: ["r1", "r2"]
```
- 双端数据分离为正向和反向读取
```
outputs: ["other"]
```
- 单端数据输出为单个文件
```
outputs: ["r0"]
```
- 所有读取（混合双端/单端）

过滤选项：

```
inclusive_filter: ["2"]
```
- 要求有效配对标志
```
exclusive_filter: ["4", "8"]
```
- 排除未比对或配对另一端未比对的读取
```
exclusive_filter_all: ["8"]
```
- 排除配对另一端未比对的读取

**关键提示：**使用合适的过滤条件确保R1/R2文件匹配！

Common Patterns and Best Practices

常见模式与最佳实践

Pattern 1: Filtering Paired-End Data by Regions

模式1：按区域过滤双端数据

WRONG WAY (breaks pairs):

bash

undefined

错误方式（破坏配对）：

bash

undefined

Region filter first → breaks pairs when mates are in different regions

先进行区域过滤 → 当配对两端位于不同区域时，配对关系被破坏

samtools view -b -L regions.bed input.bam | bamtools filter -isPaired -isProperPair

Result: Empty output (all pairs broken)

结果：输出为空（所有配对均被破坏）


**RIGHT WAY (preserves pairs):**
```bash


**正确方式（保留配对）：**
```bash

Proper pair filter FIRST, then region filter

先过滤有效配对，再进行区域过滤

samtools view -b -f 2 -L regions.bed input.bam > output.bam

Result: Pairs where both mates are in regions (or one mate in region, other anywhere)

结果：保留两端均在区域内，或一端在区域内另一端在任意位置的配对


**BEST WAY (both mates in regions):**
```bash


**最佳方式（两端均在区域内）：**
```bash

Filter for proper pairs, then use paired-aware region filtering

过滤有效配对，然后使用支持配对的区域过滤

samtools view -b -f 2 input.bam | \

Custom script to keep pairs where both mates in regions

undefined

samtools view -b -f 2 input.bam | \

自定义脚本保留两端均在区域内的配对

undefined

Pattern 2: Extracting FASTQ from Filtered BAM

模式2：从过滤后的BAM中提取FASTQ

For Paired-End:

bash

undefined

双端数据：

bash

undefined

Ensure proper pairs before extraction

提取前确保为有效配对

samtools fastx -1 R1.fq.gz -2 R2.fq.gz
--i1-flags 2 \ # Require proper pair --i2-flags 64,128 \ # Separate R1/R2 input.bam


**For Single-End:**
```bash

samtools fastx -1 R1.fq.gz -2 R2.fq.gz \ --i1-flags 2 \ # 要求有效配对 --i2-flags 64,128 \ # 分离R1/R2 input.bam


**单端数据：**
```bash

Simple extraction

简单提取

samtools fastx -0 output.fq.gz input.bam

undefined

samtools fastx -0 output.fq.gz input.bam

undefined

Pattern 3: Quality Filtering

模式3：质量过滤

Conservative (high quality):

bash

samtools view -b -q 30 -f 2 -F 256 -F 2048 input.bam

保守模式（高质量）：

bash

samtools view -b -q 30 -f 2 -F 256 -F 2048 input.bam

MAPQ >= 30, proper pairs, no secondary/supplementary

MAPQ >=30，有效配对，无次级/补充比对


**Permissive (for low-coverage data):**
```bash
samtools view -b -q 10 -F 4 input.bam


**宽松模式（针对低覆盖度数据）：**
```bash
samtools view -b -q 10 -F 4 input.bam

MAPQ >= 10, mapped reads

MAPQ >=10，已比对的读取

undefined

undefined

Common Issues and Solutions

常见问题与解决方案

Issue 1: Empty Output After Region Filtering (Hi-C Data)

问题1：区域过滤后输出为空（Hi-C数据）

Symptom:

BAM file non-empty before filtering
Empty after region filtering + proper pair filtering
Happens with paired-end data (especially Hi-C)

Cause:

Region filter (
```
samtools view -L
```
) breaks read pairs
One mate in region, other mate outside region
Proper pair flag (0x2) is lost
Subsequent
```
isProperPair
```
filter removes all reads

Solution:

bash

undefined

症状：

过滤前BAM文件非空
区域过滤+有效配对过滤后输出为空
常发生在双端数据（尤其是Hi-C）中

原因：

区域过滤（
```
samtools view -L
```
）破坏了读取对
一端在区域内，另一端在区域外
有效配对标志（0x2）丢失
后续的
```
isProperPair
```
过滤移除了所有读取

解决方案：

bash

undefined

Apply proper pair filter BEFORE region filtering

先应用有效配对过滤，再进行区域过滤

samtools view -b -f 2 -L regions.bed input.bam > output.bam


**See Also:** `common-issues.md` for detailed troubleshooting

samtools view -b -f 2 -L regions.bed input.bam > output.bam


**另见：**`common-issues.md`获取详细故障排除指南

Issue 2: R1 and R2 Files Have Different Read Counts

问题2：R1和R2文件的读取数量不同

Symptom:

Forward and reverse FASTQ files have different numbers of reads
Downstream tools fail expecting matched pairs

Cause:

Improper filtering broke some pairs
One mate filtered out, other kept
Extraction didn't require proper pairing

Solution:

bash

undefined

症状：

正向和反向FASTQ文件的读取数量不一致
下游工具因期望配对匹配而运行失败

原因：

不当的过滤破坏了部分配对
一端被过滤掉，另一端被保留
提取时未要求有效配对

解决方案：

bash

undefined

Require proper pairs during extraction

提取时要求有效配对

samtools fastx -1 R1.fq -2 R2.fq --i1-flags 2 input.bam

undefined

samtools fastx -1 R1.fq -2 R2.fq --i1-flags 2 input.bam

undefined

Issue 3: Low Mapping Rate for Hi-C Data

问题3：Hi-C数据的比对率低

Symptom:

Many Hi-C reads unmapped or low MAPQ
Expected for Hi-C due to chimeric reads

Not Actually a Problem:

Hi-C involves ligation of distant DNA fragments
Creates chimeric molecules
Mappers may mark these as low quality or unmapped
This is normal for Hi-C data

Solution:

Use Hi-C-specific pipelines (e.g., HiC-Pro, Juicer)
Don't filter too aggressively on MAPQ
Accept lower mapping rates than DNA-seq

症状：

许多Hi-C读取未比对或MAPQ值低
因嵌合读取，这在Hi-C中属于正常现象

并非实际问题：

Hi-C涉及远距离DNA片段的连接
形成嵌合分子
比对工具可能将这些标记为低质量或未比对
这对Hi-C数据来说是正常的

解决方案：

使用Hi-C专用流程（如HiC-Pro、Juicer）
不要过度严格地过滤MAPQ
接受比DNA-seq更低的比对率

Issue 4: Proper Pairs Lost After Mapping

问题4：比对后丢失有效配对

Symptom:

Few reads marked as proper pairs (flag 0x2)
Expected paired-end data

Possible Causes:

Insert size distribution wrong (check aligner parameters)
Reference mismatch (reads from different assembly)
Poor library quality
Incorrect orientation flags passed to aligner

Solution:

bash

undefined

症状：

很少有读取被标记为有效配对（标志0x2）
数据应为双端数据

可能原因：

插入片段大小分布异常（检查比对工具参数）
参考序列不匹配（读取来自不同的组装版本）
文库质量差
传递给比对工具的方向标志错误

解决方案：

bash

undefined

Check insert size distribution

检查插入片段大小分布

samtools stats input.bam | grep "insert size"

Check pairing flags

检查配对标志

samtools flagstat input.bam

undefined

samtools flagstat input.bam

undefined

Quality Metrics

质量指标

N50 and Related Metrics

N50及相关指标

N50: Length of the shortest contig at which 50% of total assembly is contained in contigs of that length or longer

How to interpret:

Higher N50 = better contiguity
Compare to expected chromosome/scaffold sizes
Use with caution - can be misleading for fragmented assemblies

Related Metrics:

L50: Number of contigs needed to reach N50
N90: More stringent than N50 (90% coverage)
NG50: N50 relative to genome size (better for comparisons)

**N50：**当50%的总组装序列包含在该长度或更长的contig中时，最短contig的长度

解读方式：

N50值越高，连续性越好
与预期的染色体/scaffold大小对比
谨慎使用——在碎片化组装中可能产生误导

相关指标：

**L50：**达到N50所需的contig数量
**N90：**比N50更严格（覆盖90%的序列）
**NG50：**相对于基因组大小的N50（更适合比较）

Coverage and Depth

覆盖度与深度

Coverage: Percentage of reference bases covered by at least one read Depth: Average number of reads covering each base

Recommended Depths:

Genome assembly (HiFi): 30-50x
Variant calling: 30x minimum
RNA-seq: 20-40 million reads
Hi-C scaffolding: 50-100x genomic coverage

**覆盖度：**至少被一条读取覆盖的参考碱基的百分比 **深度：**每个碱基被读取覆盖的平均次数

推荐深度：

基因组组装（HiFi）：30-50x
变异检测：最低30x
RNA-seq：20-40百万条读取
Hi-C scaffolding：50-100x基因组覆盖度

File Format Quick Reference

文件格式速查

FASTA

>sequence_id description
ATCGATCGATCG
ATCGATCG

Header line starts with
```
>
```
Can span multiple lines
No quality scores

>sequence_id description
ATCGATCGATCG
ATCGATCG

标题行以
```
>
```
开头
序列可跨多行
无质量评分

FASTQ

@read_id
ATCGATCGATCG
+
IIIIIIIIIIII

Four lines per read
Quality scores (Phred+33 encoding typical)
Can be gzipped (.fastq.gz)

@read_id
ATCGATCGATCG
+
IIIIIIIIIIII

每条读取占4行
质量评分（通常为Phred+33编码）
可被压缩为.gz格式（.fastq.gz）

BED

chr1    1000    2000    feature_name    score    +

0-based coordinates
Used for regions, features, intervals
Minimum 3 columns (chrom, start, end)

chr1    1000    2000    feature_name    score    +

0-based坐标
用于表示区域、特征、区间
至少3列（染色体、起始、终止）

AGP

chr1    1    5000    1    W    contig_1    1    5000    +
chr1    5001 5100    2    U    100    scaffold    yes    proximity_ligation

Tab-delimited genome assembly format
1-based closed coordinates [start, end]
Describes construction of objects from components
Object and component lengths must match
See AGP Format section for complete specification

chr1    1    5000    1    W    contig_1    1    5000    +
chr1    5001 5100    2    U    100    scaffold    yes    proximity_ligation

制表符分隔的基因组组装格式
1-based闭合坐标 [start, end]
描述如何从组件构建目标序列
目标序列与组件的长度必须匹配
完整规范请参考AGP格式章节

Best Practices

最佳实践

General

通用原则

Always check data type: Paired-end vs single-end determines filtering strategy
Understand your sequencing technology: Hi-C behaves differently than HiFi
Filter in the right order: Proper pairs BEFORE region filtering
Validate outputs: Check file sizes, read counts, flagstat
Use appropriate MAPQ thresholds: Too stringent = lost data, too permissive = noise

**始终检查数据类型：**双端与单端数据决定了过滤策略
**了解您的测序技术：**Hi-C与HiFi的处理方式不同
**按正确顺序过滤：**先过滤有效配对，再进行区域过滤
**验证输出：**检查文件大小、读取数量、flagstat结果
**使用合适的MAPQ阈值：**过于严格会丢失数据，过于宽松会引入噪音

For Hi-C Data

Hi-C数据专用

Expect distant read pairs: Don't be surprised by different scaffolds
Preserve proper pairs: Critical for downstream scaffolding
Use paired-aware tools: Standard filters may break pairs
Don't over-filter on MAPQ: Hi-C often has lower MAPQ than DNA-seq

**预期远距离读取对：**不要对不同scaffold的比对结果感到惊讶
**保留有效配对：**对下游scaffold构建至关重要
**使用支持配对的工具：**标准过滤可能破坏配对
**不要过度过滤MAPQ：**Hi-C的MAPQ通常低于DNA-seq

For HiFi Data

HiFi数据专用

Single-end processing: No pair concerns
High quality expected: Can use strict filters
Use appropriate presets: minimap2
```
map-hifi
```
or
```
map-pb
```
Consider read length distribution: HiFi reads vary in length

**单端处理：**无需考虑配对问题
**预期高质量：**可使用严格过滤条件
**使用合适的预设参数：**minimap2的
```
map-hifi
```
或
```
map-pb
```
**考虑读取长度分布：**HiFi读取的长度存在差异

For Tool Testing

工具测试专用

Create self-contained datasets: Both mates in selected region
Maintain proper pairs: Essential for realistic testing
Use representative data: Subsample proportionally, not randomly
Verify file sizes: Too small = overly filtered

**创建独立数据集：**配对两端均在选定区域内
**保留有效配对：**对真实测试至关重要
**使用代表性数据：**按比例抽样，而非随机抽样
**验证文件大小：**文件过小说明过滤过于严格

Genome Ark Data Retrieval

Genome Ark数据检索

Overview

概述

Genome Ark (s3://genomeark/) is a public AWS S3 bucket containing VGP assemblies and QC data. Access requires no credentials using

--no-sign-request

Genome Ark（s3://genomeark/）是一个公共AWS S3存储桶，包含VGP组装和QC数据。使用

--no-sign-request

参数无需凭证即可访问。

Directory Structure

目录结构

s3://genomeark/species/{SPECIES_NAME}/{TOLID}/
├── assembly_vgp_HiC_2.0/          # Standard HiC assembly
├── assembly_vgp_standard_2.1/     # Standard assembly
├── assembly_vgp_trio_1.0/         # Trio assembly
├── assembly_curated/              # Manually curated
├── assembly_{METHOD}_{VERSION}/   # Method-specific (e.g., assembly_verkko_1.4)
└── genomic_data/                  # Raw sequencing data

s3://genomeark/species/{SPECIES_NAME}/{TOLID}/
├── assembly_vgp_HiC_2.0/          # 标准HiC组装
├── assembly_vgp_standard_2.1/     # 标准组装
├── assembly_vgp_trio_1.0/         #  trio组装
├── assembly_curated/              # 人工整理的组装
├── assembly_{METHOD}_{VERSION}/   # 特定方法的组装（如assembly_verkko_1.4）
└── genomic_data/                  # 原始测序数据

GenomeScope Data Locations

GenomeScope数据位置

GenomeScope summaries can be in multiple locations - search comprehensively:

Common paths (priority order):

{assembly}/evaluation/genomescope/{TOLID}_genomescope__Summary.txt

{assembly}/evaluation/genomescope2/{TOLID}_genomescope__Summary.txt

{assembly}/qc/genomescope/{TOLID}_genomescope__Summary.txt

{assembly}/intermediates/genomescope/{TOLID}_genomescope__Summary.txt

Search strategy:

Dynamically discover assembly folders (not all species use same structure)
Search multiple subfolder locations per assembly
Try standard assemblies first (HiC_2.0, standard_2.1)
Fall back to method-specific assemblies

GenomeScope摘要可能存储在多个位置，需全面搜索：

常见路径（优先级顺序）：

{assembly}/evaluation/genomescope/{TOLID}_genomescope__Summary.txt

{assembly}/evaluation/genomescope2/{TOLID}_genomescope__Summary.txt

{assembly}/qc/genomescope/{TOLID}_genomescope__Summary.txt

{assembly}/intermediates/genomescope/{TOLID}_genomescope__Summary.txt

搜索策略：

动态发现组装文件夹（并非所有物种使用相同结构）
在每个组装文件夹的多个子目录中搜索
优先尝试标准组装（HiC_2.0、standard_2.1）
fallback到特定方法的组装

Python Implementation Pattern

Python实现模式

python

def search_genomescope_recursive(species_name, tolid):
    """Search for GenomeScope data across all assembly folders."""
    species_s3 = species_name.replace(' ', '_')

    # 1. Discover assembly folders
    result = subprocess.run(
        ['aws', 's3', 'ls', f's3://genomeark/species/{species_s3}/{tolid}/',
         '--no-sign-request'],
        capture_output=True, text=True
    )

    assembly_folders = []
    for line in result.stdout.strip().split('\n'):
        if 'PRE' in line:
            folder = line.split('PRE')[1].strip().rstrip('/')
            if folder.startswith('assembly'):
                assembly_folders.append(folder)

    # 2. Search patterns
    patterns = [
        'evaluation/genomescope/{tolid}_genomescope__Summary.txt',
        'evaluation/genomescope2/{tolid}_genomescope__Summary.txt',
        'qc/genomescope/{tolid}_genomescope__Summary.txt',
        'intermediates/genomescope/{tolid}_genomescope__Summary.txt',
    ]

    # 3. Try each combination
    for assembly in assembly_folders:
        for pattern in patterns:
            s3_path = f's3://genomeark/species/{species_s3}/{tolid}/{assembly}/{pattern.format(tolid=tolid)}'
            result = subprocess.run(
                ['aws', 's3', 'cp', s3_path, '-', '--no-sign-request'],
                capture_output=True, text=True, timeout=10
            )
            if result.returncode == 0 and result.stdout:
                return result.stdout

    return None

python

def search_genomescope_recursive(species_name, tolid):
    """在所有组装文件夹中搜索GenomeScope数据。"""
    species_s3 = species_name.replace(' ', '_')

    # 1. 发现组装文件夹
    result = subprocess.run(
        ['aws', 's3', 'ls', f's3://genomeark/species/{species_s3}/{tolid}/',
         '--no-sign-request'],
        capture_output=True, text=True
    )

    assembly_folders = []
    for line in result.stdout.strip().split('\
'):
        if 'PRE' in line:
            folder = line.split('PRE')[1].strip().rstrip('/')
            if folder.startswith('assembly'):
                assembly_folders.append(folder)

    # 2. 搜索模式
    patterns = [
        'evaluation/genomescope/{tolid}_genomescope__Summary.txt',
        'evaluation/genomescope2/{tolid}_genomescope__Summary.txt',
        'qc/genomescope/{tolid}_genomescope__Summary.txt',
        'intermediates/genomescope/{tolid}_genomescope__Summary.txt',
    ]

    # 3. 尝试所有组合
    for assembly in assembly_folders:
        for pattern in patterns:
            s3_path = f's3://genomeark/species/{species_s3}/{tolid}/{assembly}/{pattern.format(tolid=tolid)}'
            result = subprocess.run(
                ['aws', 's3', 'cp', s3_path, '-', '--no-sign-request'],
                capture_output=True, text=True, timeout=10
            )
            if result.returncode == 0 and result.stdout:
                return result.stdout

    return None

GenomeScope Summary Parsing

GenomeScope摘要解析

python

def parse_genomescope_summary(content):
    """Extract genome characteristics from GenomeScope2 summary."""
    data = {}

    # Genome Haploid Length (max value - second column)
    match = re.search(r'Genome Haploid Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
    if match:
        data['genome_size_genomescope'] = int(match.group(1).replace(',', ''))

    # Heterozygosity percentage (max value)
    match = re.search(r'Heterozygous \(ab\)\s+[\d.]+%\s+([\d.]+)%', content)
    if match:
        data['heterozygosity_percent'] = float(match.group(1))

    # Genome Unique Length (max value)
    match = re.search(r'Genome Unique Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
    if match:
        data['unique_length'] = int(match.group(1).replace(',', ''))

    # Genome Repeat Length (max value)
    match = re.search(r'Genome Repeat Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content)
    if match:
        data['repeat_length'] = int(match.group(1).replace(',', ''))

    # Calculate repeat content percentage
    if 'repeat_length' in data and 'unique_length' in data:
        total = data['repeat_length'] + data['unique_length']
        if total > 0:
            data['repeat_content_percent'] = (data['repeat_length'] / total) * 100

    return data

python

def parse_genomescope_summary(content):
    """从GenomeScope2摘要中提取基因组特征。"""
    data = {}

    # 单倍体基因组长度（最大值 - 第二列）
    match = re.search(r'Genome Haploid Length\\s+[\\d,]+\\s*bp\\s+([\\d,]+)\\s*bp', content)
    if match:
        data['genome_size_genomescope'] = int(match.group(1).replace(',', ''))

    # 杂合度百分比（最大值）
    match = re.search(r'Heterozygous \\(ab\\)\\s+[\\d.]+%\\s+([\\d.]+)%', content)
    if match:
        data['heterozygosity_percent'] = float(match.group(1))

    # 基因组唯一序列长度（最大值）
    match = re.search(r'Genome Unique Length\\s+[\\d,]+\\s*bp\\s+([\\d,]+)\\s*bp', content)
    if match:
        data['unique_length'] = int(match.group(1).replace(',', ''))

    # 基因组重复序列长度（最大值）
    match = re.search(r'Genome Repeat Length\\s+[\\d,]+\\s*bp\\s+([\\d,]+)\\s*bp', content)
    if match:
        data['repeat_length'] = int(match.group(1).replace(',', ''))

    # 计算重复序列含量百分比
    if 'repeat_length' in data and 'unique_length' in data:
        total = data['repeat_length'] + data['unique_length']
        if total > 0:
            data['repeat_content_percent'] = (data['repeat_length'] / total) * 100

    return data

Best Practices

最佳实践

Rate limiting: Add 0.2s delay between successful fetches to be respectful to AWS
Caching: Check if file exists locally before fetching
Timeout: Use 10-15s timeout per request
Assembly discovery: Always discover assemblies dynamically - don't assume structure
Multiple locations: GenomeScope data may be in evaluation/, qc/, or intermediates/
Expected coverage: ~15-20% of VGP assemblies have GenomeScope data available

速率限制：在成功获取之间添加0.2秒延迟，以避免对AWS造成过大压力
缓存：获取前先检查文件是否已存在于本地
超时：每个请求使用10-15秒的超时时间
组装发现：始终动态发现组装文件夹——不要假设固定结构
多位置搜索：GenomeScope数据可能存储在evaluation/、qc/或intermediates/目录中
预期覆盖率：约15-20%的VGP组装提供GenomeScope数据

Common Issues

常见问题

Species not found: Some species use different naming (check exact spacing/underscores)

bash

undefined

物种未找到：部分物种使用不同的命名方式（检查空格/下划线的准确使用）

bash

undefined

Search for species

搜索物种

aws s3 ls s3://genomeark/species/ --no-sign-request | grep -i "species_name"


**Assembly folder variations**: Not all use standard names
```bash

aws s3 ls s3://genomeark/species/ --no-sign-request | grep -i "species_name"


**组装文件夹命名变化**：并非所有组装都使用标准名称
```bash

List all assemblies for a species

列出某物种的所有组装

aws s3 ls s3://genomeark/species/{SPECIES}/{TOLID}/ --no-sign-request

undefined

aws s3 ls s3://genomeark/species/{SPECIES}/{TOLID}/ --no-sign-request

undefined

AWS CLI Setup

AWS CLI设置

bash

undefined

bash

undefined

Install AWS CLI (no credentials needed for Genome Ark)

安装AWS CLI（访问Genome Ark无需凭证）

conda install -c conda-forge awscli

Test access

测试访问

aws s3 ls s3://genomeark/species/ --no-sign-request | head

undefined

aws s3 ls s3://genomeark/species/ --no-sign-request | head

undefined

Karyotype Data Curation and Literature Search

核型数据整理与文献检索

Overview

概述

Karyotype data (diploid 2n and haploid n chromosome numbers) is critical for genome assembly validation but rarely available via APIs. Manual literature curation is required.

核型数据（二倍体2n和单倍体n染色体数目）对基因组组装验证至关重要，但很少能通过API获取。需要手动进行文献整理。

Search Strategy

搜索策略

Effective Search Terms

有效搜索词

"{species_name} karyotype chromosome 2n"
"{species_name} diploid number karyotype"
"{genus} karyotype evolution"
"cytogenetic analysis {family_name}"
"{species_name} chromosome number diploid"

"{species_name} karyotype chromosome 2n"
"{species_name} diploid number karyotype"
"{genus} karyotype evolution"
"cytogenetic analysis {family_name}"
"{species_name} chromosome number diploid"

Best Reference Sources

Search Time Estimates

搜索时间估算

Model organisms, domestic species: 2-3 minutes
Well-studied taxonomic groups: 5-10 minutes
Rare/uncommon species: 10-20 minutes or not found

模式生物、家养物种：2-3分钟
研究充分的类群：5-10分钟
稀有/不常见物种：10-20分钟或无法找到

Taxonomic Conservation Patterns

分类学保守模式

Mammals

哺乳动物

Cetaceans: Highly conserved 2n = 44, n = 22 (exceptions: pygmy sperm whale, right whale, beaked whales = 2n = 42)
Felidae: Conserved 2n = 38, n = 19
Canidae: Conserved 2n = 78, n = 39
Primates: Variable (great apes 2n = 48, macaques 2n = 42, marmosets 2n = 46)

鲸目：高度保守的2n = 44，n = 22（例外：小抹香鲸、露脊鲸、喙鲸 = 2n = 42）
猫科：保守的2n = 38，n = 19
犬科：保守的2n = 78，n = 39
灵长目：变异较大（类人猿2n = 48，猕猴2n = 42，狨猴2n = 46）

Birds

鸟类

Anatidae (waterfowl): Highly conserved 2n = 80, n = 40 across ducks, geese, swans
Galliformes (game birds): Typically 2n = 78, n = 39 (chicken, quail, grouse)
Passerines: Variable 2n = 78-82, most common 2n = 80
Ancestral avian karyotype: Putative 2n = 80
General pattern: 50.7% of birds have 2n = 78-82; 21.7% have exactly 2n = 80

鸭科（水禽）：高度保守的2n = 80，n = 40，涵盖鸭、鹅、天鹅
鸡形目（猎鸟）：通常2n = 78，n = 39（鸡、鹌鹑、松鸡）
雀形目：变异范围2n = 78-82，最常见的是2n = 80
祖先鸟类核型：推测为2n = 80
总体模式：50.7%的鸟类2n = 78-82；21.7%的鸟类2n = 80

Reptiles

爬行动物

Lacertidae (wall lizards): Often 2n = 38, n = 19

蜥蜴科：通常2n = 38，n = 19

Genome Assembly Interpretation

基因组组装解读

⚠️ Warning: Chromosome-level assemblies often report fewer chromosomes than actual diploid number

Why: Assemblies typically capture only:

Macrochromosomes (large chromosomes)
Larger microchromosomes
Small microchromosomes remain unassembled

Example: Waterfowl with 2n = 80 often have genome assemblies with 34-42 "chromosomes"

True karyotype: 10 macro pairs + 30 micro pairs = 80
Assembly: ~34-42 scaffolds (only macro + larger micros)

⚠️ 警告：染色体水平的组装通常报告的染色体数目少于实际的二倍体数目

原因：组装通常仅捕获：

大染色体（macrochromosomes）
较大的小染色体（microchromosomes）
小型microchromosomes仍未被组装

示例：2n = 80的水禽，其基因组组装通常有34-42条“染色体”

真实核型：10对大染色体 + 30对小染色体 = 80
组装结果：约34-42个scaffold（仅包含大染色体和较大的小染色体）

Using Conservation for Inference

利用保守性进行推断

When specific karyotype data is unavailable but genus/family patterns are strong:

High confidence inference (acceptable for publication):
- Multiple congeneric species confirmed
- Family-level conservation documented
- No known exceptions in genus

Document inference clearly:

csv

accession,taxid,species,2n,n,notes,reference
GCA_XXX,123,Species name,80,40,Inferred from Anatidae conservation,https://family-level-study.url

Priority for direct confirmation:
- Species with conservation exceptions
- Type specimens or reference species
- Phylogenetically divergent lineages

当特定物种的核型数据不可用，但属/科水平的保守性很强时：

高可信度推断（可用于发表）：
- 多个同属物种已被证实
- 科水平的保守性有文献记录
- 属内无已知例外

清晰记录推断依据：

csv

accession,taxid,species,2n,n,notes,reference
GCA_XXX,123,Species name,80,40,基于鸭科保守性推断,https://family-level-study.url

优先进行直接验证的情况：
- 存在保守性例外的物种
- 模式标本或参考物种
- 系统发育上差异较大的类群

VGP-Specific: Sex Chromosome Adjustment

VGP专用：性染色体调整

When both sex chromosomes are in main haplotype (common in VGP assemblies):

Expected scaffolds = n + 1 (not n)
Reason: X+Y or Z+W = two distinct chromosomes
Check: VGP metadata column "Sex chromosomes main haplotype"
Patterns: "Has X and Y", "Has Z and W", "Has X1, X2, and Y"

当两条性染色体都在主要单倍型中时（在VGP组装中很常见）：

预期scaffold数目 = n + 1（而非n）
原因：X+Y或Z+W = 两条不同的染色体
检查方式：VGP元数据列“Sex chromosomes main haplotype”
常见模式：“Has X and Y”、“Has Z and W”、“Has X1, X2, and Y”

Data Recording Format

数据记录格式

CSV Structure:

csv

accession,taxid,species_name,diploid_2n,haploid_n,notes,reference
GCA_XXXXXX,12345,Species name,80,40,Brief description,https://doi.org/...

Notes field examples:

"Standard {family} karyotype"
"Conserved {genus} karyotype"
"Inferred from {family} conservation"
"Unusual karyotype for family"
"Geographic variation reported"

CSV结构：

csv

accession,taxid,species_name,diploid_2n,haploid_n,notes,reference
GCA_XXXXXX,12345,Species name,80,40,简要描述,https://doi.org/...

Notes字段示例：

“标准{科}核型”
“保守{属}核型”
“基于{科}保守性推断”
“科内异常核型”
“存在地理变异”

Prioritization for Literature Searches

文献检索优先级

TIER 1 (>90% success rate):

Model organisms (zebrafish, mouse, medaka)
Domestic species (chicken, goat, sheep)
Game animals (waterfowl, deer)
Laboratory species (fruit fly, nematode)

TIER 2 (70-90% success rate):

Well-studied taxonomic groups (Podarcis lizards, corvids)
Conservation focus species (raptors, large mammals)
Commercial species (salmonids, oysters)

TIER 3 (50-70% success rate):

Common but not economically important
Widespread distribution
Recent phylogenetic interest

Low priority (<50% success rate):

Deep-sea species
Rare/endangered without conservation genetics
Recently described species
Cryptic species complexes

TIER 1（成功率>90%）：

模式生物（斑马鱼、小鼠、青鳉）
家养物种（鸡、山羊、绵羊）
狩猎动物（水禽、鹿）
实验物种（果蝇、线虫）

TIER 2（成功率70-90%）：

研究充分的类群（Podarcis蜥蜴、鸦科）
保护重点物种（猛禽、大型哺乳动物）
商业物种（鲑科、牡蛎）

TIER 3（成功率50-70%）：

常见但无经济重要性
分布广泛
近期系统发育研究热点

低优先级（成功率<50%）：

深海物种
无保护遗传学研究的稀有/濒危物种
近期描述的物种
隐存种复合体

AGP Format (A Golden Path)

AGP格式（A Golden Path）

Overview

概述

AGP (A Golden Path) format describes how assembled sequences (chromosomes, scaffolds) are constructed from component sequences (contigs, scaffolds) and gaps. Critical for genome assembly curation and submission to NCBI/EBI.

AGP（A Golden Path）格式描述了如何从组件序列（contig、scaffold）和间隙构建组装序列（染色体、scaffold）。对基因组组装整理以及向NCBI/EBI提交数据至关重要。

When to Use This Knowledge

何时使用此知识

Processing genome assemblies for submission to databases
Curating chromosome-level assemblies
Splitting haplotype assemblies
Assigning unlocalized scaffolds (unlocs)
Debugging AGP validation errors
Converting between assembly representations

处理用于数据库提交的基因组组装
整理染色体水平的组装
拆分单倍体组装
分配未定位scaffold（unlocs）
调试AGP验证错误
在不同组装表示形式之间转换

AGP Format Structure

AGP格式结构

Tab-delimited format with 9 columns for sequence lines (type W) or 8+ columns for gap lines (type U/N)

Sequence Lines (Column 5 = 'W'):

object  obj_beg  obj_end  part_num  W  component_id  comp_beg  comp_end  orientation

Gap Lines (Column 5 = 'U' or 'N'):

object  obj_beg  obj_end  part_num  gap_type  gap_length  gap_type  linkage  linkage_evidence

序列行（第5列='W'）为9列，间隙行（第5列='U'/'N'）为8+列的制表符分隔格式

序列行（第5列='W'）：

object  obj_beg  obj_end  part_num  W  component_id  comp_beg  comp_end  orientation

间隙行（第5列='U'/'N'）：

object  obj_beg  obj_end  part_num  gap_type  gap_length  gap_type  linkage  linkage_evidence

Critical Coordinate Rules

关键坐标规则

Rule 1: Object and Component Lengths MUST Match

For sequence lines, the span in the object MUST equal the span in the component:

(obj_end - obj_beg + 1) == (comp_end - comp_beg + 1)

Example - CORRECT:

Scaffold_47_unloc_1  1  54360  1  W  scaffold_23.hap1  19274039  19328398  -

规则1：目标序列与组件的长度必须匹配

对于序列行，目标序列中的跨度必须与组件中的跨度相等：

(obj_end - obj_beg + 1) == (comp_end - comp_beg + 1)

示例 - 正确：

Scaffold_47_unloc_1  1  54360  1  W  scaffold_23.hap1  19274039  19328398  -

Object length: 54360 - 1 + 1 = 54,360 bp

目标序列长度：54360 - 1 + 1 = 54,360 bp

Component length: 19328398 - 19274039 + 1 = 54,360 bp ✓

组件长度：19328398 - 19274039 + 1 = 54,360 bp ✓


**Example - INCORRECT:**

Scaffold_47_unloc_1 1 19328398 1 W scaffold_23.hap1 19274039 19328398 -


**示例 - 错误：**

Scaffold_47_unloc_1 1 19328398 1 W scaffold_23.hap1 19274039 19328398 -

Object length: 19328398 - 1 + 1 = 19,328,398 bp

目标序列长度：19328398 - 1 + 1 = 19,328,398 bp

Component length: 19328398 - 19274039 + 1 = 54,360 bp ✗

组件长度：19328398 - 19274039 + 1 = 54,360 bp ✗

ERROR: Lengths don't match!

错误：长度不匹配！


**Rule 2: Component Numbering Restarts for New Objects**

Each object (column 1) has its own component numbering (column 4) starting at 1:

Scaffold_10 1 30578279 1 W scaffold_4.hap2 1 30578279 - Scaffold_10_unloc_1 1 65764 1 W scaffold_74.hap2 1 65764 + # ← Starts at 1, not 3!


**Rule 3: Sequential Component Numbering Within Objects**

Component numbers increment sequentially (gaps and sequences both count):

Scaffold_2 1 1731008 1 W scaffold_25.hap1 1 1731008 - Scaffold_2 1731009 1731108 2 U 100 scaffold yes proximity_ligation Scaffold_2 1731109 1956041 3 W scaffold_70.hap1 1 224933 -

undefined


**规则2：组件编号对新的目标序列重新开始**

每个目标序列（第1列）有自己的组件编号（第4列），从1开始：

Scaffold_10 1 30578279 1 W scaffold_4.hap2 1 30578279 - Scaffold_10_unloc_1 1 65764 1 W scaffold_74.hap2 1 65764 + # ← 从1开始，而非3！


**规则3：目标序列内的组件编号连续递增**

组件编号连续递增（间隙和序列都计数）：

Scaffold_2 1 1731008 1 W scaffold_25.hap1 1 1731008 - Scaffold_2 1731009 1731108 2 U 100 scaffold yes proximity_ligation Scaffold_2 1731109 1956041 3 W scaffold_70.hap1 1 224933 -

undefined

Common AGP Processing Issues

常见AGP处理问题

Issue 1: Incorrect Object Coordinates When Creating Unlocs

问题1：创建Unlocs时目标序列坐标错误

Symptom:

ERROR: object coordinates (1, 19328398) and component coordinates (19274039, 19328398)
do not have the same length

Cause: When converting a region of a scaffold into an unlocalized sequence (unloc), the object coordinates must represent the length of the extracted region, not the original component end coordinate.

Wrong Approach:

python

undefined

症状：

ERROR: object coordinates (1, 19328398) and component coordinates (19274039, 19328398)
do not have the same length

原因： 当将scaffold的一个区域转换为未定位序列（unloc）时，目标序列坐标必须表示提取区域的长度，而非原始组件的终止坐标。

错误方法：

python

undefined

Setting object end to component end coordinate

将目标序列终止坐标设置为组件终止坐标

agp_df.loc[index, 'chr_end'] = agp_df.loc[index, 'scaff_end'] # ✗ WRONG


**Correct Approach:**
```python

agp_df.loc[index, 'chr_end'] = agp_df.loc[index, 'scaff_end'] # ✗ 错误


**正确方法：**
```python

Calculate actual length from component coordinates

根据组件坐标计算实际长度

agp_df.loc[index, 'chr_end'] = int(agp_df.loc[index, 'scaff_end']) - int(agp_df.loc[index, 'scaff_start']) + 1 # ✓ CORRECT

undefined

agp_df.loc[index, 'chr_end'] = int(agp_df.loc[index, 'scaff_end']) - int(agp_df.loc[index, 'scaff_start']) + 1 # ✓ 正确

undefined

Issue 2: Component Numbering Not Reset for New Objects

问题2：新目标序列的组件编号未重置

Symptom: Unloc scaffolds have component numbers > 1 when they should start at 1.

Cause: When creating a new object (unloc scaffold), component numbering wasn't reset.

Solution:

python

undefined

症状： Unloc scaffold的组件编号>1，但本应从1开始。

原因： 创建新的目标序列（unloc scaffold）时，组件编号未重置。

解决方案：

python

undefined

When creating unlocs, reset component number

创建unloc时，重置组件编号

agp_df.loc[index, '#_scaffs'] = 1 # Column 4: component number

undefined

agp_df.loc[index, '#_scaffs'] = 1 # 第4列：组件编号

undefined

Issue 3: AGPcorrect Accumulating Coordinates

问题3：AGPcorrect累积坐标

Symptom: Unloc sequences inherit cumulative coordinates from parent scaffolds.

Cause: AGPcorrect adjusts coordinates based on sequence length corrections. When scaffolds are later split into unlocs, the accumulated corrections need to be recalculated based on actual component spans.

Solution: Always recalculate object coordinates from component spans when creating new objects (unlocs).

症状： Unloc序列继承了来自父scaffold的累积坐标。

原因： AGPcorrect会根据序列长度校正调整坐标。当scaffold随后被拆分为unloc时，需要基于实际组件跨度重新计算累积校正值。

解决方案： 创建新目标序列（unloc）时，始终根据组件跨度重新计算目标序列坐标。

AGP Processing Best Practices

AGP处理最佳实践

1. Coordinate System Understanding

1. 理解坐标系

Object coordinates (columns 2-3): Position within the assembled object (1-based, inclusive)
Component coordinates (columns 7-8): Position within the source sequence (1-based, inclusive)
Both use 1-based closed intervals [start, end]
Length calculation:
```
end - start + 1
```

**目标序列坐标（第2-3列）：**在组装目标序列中的位置（1-based，包含首尾）
**组件坐标（第7-8列）：**在源序列中的位置（1-based，包含首尾）
两者均使用1-based闭合区间 [start, end]
长度计算：
```
end - start + 1
```

2. Creating Unlocalized Sequences (Unlocs)

2. 创建未定位序列（Unlocs）

python

undefined

python

undefined

When extracting a region to create an unloc:

提取区域创建unloc时：

1. Calculate the actual length of the region

1. 计算区域的实际长度

length = int(comp_end) - int(comp_start) + 1

2. Set object coordinates for the new unloc

2. 设置新unloc的目标序列坐标

obj_start = 1 # Always starts at 1 obj_end = length # Equals the length

obj_start = 1 # 始终从1开始 obj_end = length # 等于区域长度

3. Reset component number

3. 重置组件编号

component_num = 1 # New object, new numbering

component_num = 1 # 新目标序列，新编号

4. Rename the object

4. 重命名目标序列

new_object_name = f"{parent_scaffold}unloc{unloc_number}"

undefined

new_object_name = f"{parent_scaffold}unloc{unloc_number}"

undefined

3. Validating AGP Files

3. 验证AGP文件

Use NCBI's AGP validator:

bash

agp_validate assembly.agp

Common validation checks:

Object/component length match
Sequential component numbering
No coordinate overlaps
Gap specifications valid
Orientation values (+, -, ?, 0, na)

使用NCBI的AGP验证工具：

bash

agp_validate assembly.agp

常见验证检查项：

目标序列/组件长度匹配
组件编号连续
无坐标重叠
间隙规范有效
方向值合法（+、-、?、0、na）

4. Handling Haplotype-Split Assemblies

4. 处理单倍体拆分组装

When splitting diploid assemblies into haplotypes:

Identify haplotype markers in sequence names (H1/hap1, H2/hap2)
Maintain proper pairing information
Process unlocs separately per haplotype
Remove haplotig duplications
Track gaps appropriately (especially proximity ligation gaps)

拆分二倍体组装为单倍体时：

识别序列名称中的单倍体标记（H1/hap1、H2/hap2）
保留正确的配对信息
按单倍体分别处理unloc
移除单倍型重复序列
正确处理间隙（尤其是邻近连接间隙）

AGP Coordinate Debugging Pattern

AGP坐标调试模式

When encountering coordinate errors:

python

undefined

遇到坐标错误时：

python

undefined

For each AGP line, verify:

对每个AGP行，验证：

obj_length = int(obj_end) - int(obj_beg) + 1 comp_length = int(comp_end) - int(comp_beg) + 1

assert obj_length == comp_length, f"Length mismatch: obj={obj_length}, comp={comp_length}"

obj_length = int(obj_end) - int(obj_beg) + 1 comp_length = int(comp_end) - int(comp_beg) + 1

assert obj_length == comp_length, f"长度不匹配：obj={obj_length}, comp={comp_length}"

For sequential component numbers:

验证组件编号连续：

assert comp_num == expected_num, f"Component number gap: got {comp_num}, expected {expected_num}"

undefined

assert comp_num == expected_num, f"组件编号不连续：实际{comp_num}，预期{expected_num}"

undefined

AGP File Structure by Assembly Stage

不同组装阶段的AGP文件结构

1. Raw Assembly AGP:

Direct representation from assembler
May have incorrect sequence lengths
Needs coordinate correction (AGPcorrect)

2. Corrected AGP:

Sequence lengths match actual FASTA
Coordinates adjusted for length discrepancies
Ready for haplotype splitting

3. Haplotype-Split AGP:

Separate files per haplotype
Unlocs identified but not separated
Haplotigs marked but not removed

4. Final Curated AGP:

Unlocs separated into individual objects
Haplotigs removed to separate file
Proximity ligation gaps cleaned
Ready for database submission

1. 原始组装AGP：

直接来自组装工具的输出
可能存在序列长度错误
需要坐标校正（AGPcorrect）

2. 校正后的AGP：

序列长度与实际FASTA匹配
已针对长度差异调整坐标
可用于单倍体拆分

3. 单倍体拆分AGP：

按单倍体分离的文件
Unloc已识别但未分离
单倍型序列已标记但未移除

4. 最终整理后的AGP：

Unloc已分离为独立的目标序列
单倍型序列已移除到单独文件
邻近连接间隙已清理
可提交到数据库

BED File Processing and Telomere Analysis

BED文件处理与端粒分析

Pattern: Classifying Scaffolds by Telomere Types

模式：按端粒类型分类Scaffold

When analyzing telomere data from BED files to classify scaffolds:

File Structure:

Terminal telomeres BED: columns include scaffold, start, end, orientation (p/q), accession
Interstitial telomeres BED: similar structure with position markers (p/q/u for internal)

Best Practice - Use Python CSV Module:

python

import csv
from collections import defaultdict

从BED文件分析端粒数据以分类scaffold时：

文件结构：

末端端粒BED：列包含scaffold、起始、终止、方向（p/q）、accession
中间端粒BED：结构类似，包含位置标记（p/q/u表示内部）

最佳实践 - 使用Python CSV模块：

python

import csv
from collections import defaultdict

Use defaultdict for automatic initialization

使用defaultdict自动初始化

telomere_counts = defaultdict(lambda: {'terminal': 0, 'interstitial': 0})

elomere_counts = defaultdict(lambda: {'terminal': 0, 'interstitial': 0})

Process with csv.reader (more portable than pandas)

使用csv.reader处理（比pandas更便携）

with open('telomeres.bed', 'r') as f: reader = csv.reader(f, delimiter='\t') for row in reader: scaffold = row[0] accession = row[10] # GCA accession key = (accession, scaffold) telomere_counts[key]['terminal'] += 1


**Why CSV over pandas**:
- No external dependencies (pandas may not be installed)
- Faster for simple tabular operations
- Lower memory footprint for large files
- Better portability across environments

**Classification Categories**:
1. Category 1: 2 terminal telomeres, 0 interstitial (complete chromosomes)
2. Category 2: 1 terminal telomere, 0 interstitial (partial)
3. Category 3: Has interstitial telomeres (likely assembly issues)


**为什么用CSV而非pandas：**
- 无外部依赖（pandas可能未安装）
- 简单表格操作速度更快
- 大文件内存占用更低
- 跨环境可移植性更好

**分类类别：**
1. 类别1：2个末端端粒，0个中间端粒（完整染色体）
2. 类别2：1个末端端粒，0个中间端粒（部分完整）
3. 类别3：存在中间端粒（可能存在组装问题）

NCBI Data Integration Strategies

NCBI数据整合策略

Check Existing Data Sources Before API Calls

API调用前先检查现有数据源

Problem: Need chromosome counts for 400+ assemblies from NCBI.

Anti-pattern: Query NCBI datasets API for each accession

python

undefined

**问题：**需要从NCBI获取400+个组装的染色体数目。

**反模式：**为每个accession查询NCBI datasets API

python

undefined

DON'T: Query 400+ times

不要这样做：查询400+次

for accession in missing_data: result = subprocess.run(['datasets', 'summary', 'genome', 'accession', accession]) # Takes 10+ minutes, hits API rate limits


**Better Pattern**: Check if data already exists in compiled tables
```python

for accession in missing_data: result = subprocess.run(['datasets', 'summary', 'genome', 'accession', accession]) # 耗时10+分钟，触发API速率限制


**更好的模式：**检查已有的汇总表格
```python

DO: Look for existing compiled data first

应该这样做：先查找已有的汇总数据

VGP table has multiple chromosome count columns:

VGP表格包含多个染色体数字段：

- num_chromosomes (column 54)

- num_chromosomes（第54列）

- total_number_of_chromosomes (column 106)

- total_number_of_chromosomes（第106列）

- num_chromosomes_haploid (column 122)

- num_chromosomes_haploid（第122列）

Read from existing comprehensive table

从现有综合表格中读取

with open('VGP-table.csv') as f: reader = csv.reader(f) header = next(reader) for row in reader: num_chr = row[53] if row[53] else row[105] # Fallback strategy


**Results**: Filled 392/417 missing values instantly vs 10+ minutes of API calls.

**Fallback Strategy for Multiple Columns**:
```python

with open('VGP-table.csv') as f: reader = csv.reader(f) header = next(reader) for row in reader: num_chr = row[53] if row[53] else row[105] # fallback策略


**结果：**瞬间填充了392/417个缺失值，而API调用需要10+分钟。

**多列Fallback策略：**
```python

Try multiple sources in order of preference

按优先级尝试多个来源

num_chromosomes = row[53] if (len(row) > 53 and row[53]) else '' if not num_chromosomes and len(row) > 105: num_chromosomes = row[105] # Alternative column


**When to use NCBI API**:
- Data not in existing tables
- Need real-time/latest data
- Fetching assembly reports or sequence data
- Small number of queries (<20)

**API Best Practices** (when necessary):
- Use full path to datasets command (may be aliased)
- Add delays between calls (`time.sleep(0.5)`)
- Set reasonable timeouts
- Handle errors gracefully

num_chromosomes = row[53] if (len(row) > 53 and row[53]) else '' if not num_chromosomes and len(row) > 105: num_chromosomes = row[105] # 备选列


**何时使用NCBI API：**
- 数据不在现有表格中
- 需要实时/最新数据
- 获取组装报告或序列数据
- 查询数量少（<20）

**API最佳实践**（必要时）：
- 使用datasets命令的完整路径（可能被别名替换）
- 在调用之间添加延迟（`time.sleep(0.5)`）
- 设置合理的超时时间
- 优雅处理错误

Related Skills

Supporting Documentation

支持文档

reference.md: Detailed format specifications and tool documentation
common-issues.md: Comprehensive troubleshooting guide with examples

reference.md: 详细的格式规范和工具文档
common-issues.md: 包含示例的综合故障排除指南

Version History

版本历史

v1.1.1: Added BED file processing patterns for telomere analysis and NCBI data integration strategies
v1.1.0: Added comprehensive AGP format documentation including coordinate validation, unloc processing, and common error patterns
v1.0.0: Initial release with SAM/BAM, Hi-C, HiFi, common filtering patterns

v1.1.1: 添加了端粒分析的BED文件处理模式和NCBI数据整合策略
v1.1.0: 添加了全面的AGP格式文档，包括坐标验证、unloc处理和常见错误模式
v1.0.0: 初始版本，包含SAM/BAM、Hi-C、HiFi、常见过滤模式",