genomeark-aws

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GenomeArk AWS S3 Data Repository

GenomeArk AWS S3数据仓库

Comprehensive guide for accessing and navigating the GenomeArk AWS S3 public bucket containing Vertebrate Genomes Project (VGP) assemblies and quality control data.
Supporting files (read as needed for detailed code and strategies):
  • assembly-date-extraction.md - Extract assembly dates from FASTA filenames, validation rules
  • qc-data-fetching.md - GenomeScope, BUSCO, Merqury, Meryl fetching code and parsing
  • best-practices.md - AWS CLI patterns, batch processing, common pitfalls, testing examples, version history
本文是访问和浏览GenomeArk AWS S3公开存储桶的综合指南,该存储桶包含脊椎动物基因组项目(VGP)的组装数据与质量控制数据。
支持文件(按需查阅以获取详细代码与策略):
  • assembly-date-extraction.md - 从FASTA文件名中提取组装日期及验证规则
  • qc-data-fetching.md - GenomeScope、BUSCO、Merqury、Meryl的数据获取代码与解析方法
  • best-practices.md - AWS CLI使用模式、批量处理、常见陷阱、测试示例及版本历史

When to Use This Skill

何时使用本技能

Use this skill when:
  • Accessing VGP genome assemblies from GenomeArk AWS S3
  • Fetching QC metrics (GenomeScope, BUSCO, Merqury) for genomic analyses
  • Downloading genome evaluation data for comparative studies
  • Accessing meryl k-mer histograms for GenomeScope analysis
  • Building automated pipelines that fetch VGP data
  • Troubleshooting S3 path issues or missing data
  • Working with species-specific genome data from VGP
在以下场景中使用本技能:
  • 从GenomeArk AWS S3获取VGP基因组组装数据
  • 提取用于基因组分析的QC指标(GenomeScope、BUSCO、Merqury)
  • 下载用于比较研究的基因组评估数据
  • 获取用于GenomeScope分析的meryl k-mer直方图
  • 构建可自动获取VGP数据的流水线
  • 排查S3路径问题或数据缺失情况
  • 处理VGP的物种特异性基因组数据

Repository Overview

仓库概述

GenomeArk is a public AWS S3 bucket (
s3://genomeark/
) hosting:
  • VGP genome assemblies (primary, alternate, trio)
  • Quality control metrics (GenomeScope, BUSCO, Merqury)
  • Intermediate files (meryl databases, k-mer histograms)
  • Assembly evaluation reports
  • Haplotype-resolved assemblies
Access Method: Public bucket requiring no AWS credentials when using
--no-sign-request
Critical Discovery: GenomeArk structure has evolved over time (2022 -> 2024+). Always implement fallback path patterns for reliability.
GenomeArk是一个公开的AWS S3存储桶(
s3://genomeark/
),托管内容包括:
  • VGP基因组组装数据(主要组装、备选组装、 trio组装)
  • 质量控制指标(GenomeScope、BUSCO、Merqury)
  • 中间文件(meryl数据库、k-mer直方图)
  • 组装评估报告
  • 单倍型解析组装数据
访问方式:公开存储桶,使用
--no-sign-request
参数时无需AWS凭证
重要发现:GenomeArk的目录结构随时间演变(2022年 -> 2024年以后)。为保证可靠性,务必实现回退路径模式。

Directory Structure

目录结构

Base Structure

基础结构

s3://genomeark/
└── species/
    └── {Genus_species}/          # e.g., Rhinolophus_ferrumequinum
        └── {ToLID}/               # e.g., mRhiFer1 (VGP specimen ID)
            ├── assembly_vgp_{type}_{version}/
            │   ├── evaluation/     # QC metrics (MAIN ACCESS POINT)
            │   │   ├── genomescope/
            │   │   ├── busco/
            │   │   ├── merqury/
            │   │   └── ...
            │   └── intermediates/  # K-mer databases, temp files
            │       └── meryl/
            └── genomic_data/      # Raw sequencing data folders
s3://genomeark/
└── species/
    └── {Genus_species}/          # 示例:Rhinolophus_ferrumequinum
        └── {ToLID}/               # 示例:mRhiFer1(VGP样本ID)
            ├── assembly_vgp_{type}_{version}/
            │   ├── evaluation/     # QC指标(主要访问入口)
            │   │   ├── genomescope/
            │   │   ├── busco/
            │   │   ├── merqury/
            │   │   └── ...
            │   └── intermediates/  # K-mer数据库、临时文件
            │       └── meryl/
            └── genomic_data/      # 原始测序数据文件夹

Assembly Directory Variations

组装目录变体

assembly_vgp_{type}_{version} - Standard VGP Patterns:
  • assembly_vgp_HiC_2.0
    - Hi-C phased assembly (case-sensitive!)
  • assembly_vgp_standard_2.0
    - Standard assembly without Hi-C
  • assembly_vgp_hic_2.0
    - Alternative Hi-C naming
  • assembly_vgp_trio_2.0
    - Trio-binned assembly
Legacy Versions (2019-2021 assemblies):
  • assembly_vgp_standard_1.6
    - Version 1.6 (common in fish, birds)
  • assembly_vgp_standard_1.0
    - Version 1.0 (early assemblies)
  • assembly_vgp_HiC_1.6
    - Hi-C version 1.6
  • assembly_vgp_HiC_1.0
    - Hi-C version 1.0
  • assembly_vgp_HiC_1.4
    - Hi-C version 1.4
Verkko Assemblies (diploid assemblies):
  • assembly_verkko_1.4/
    - Verkko version 1.4
  • assembly_verkko_1.1-0.1/
    - Verkko version 1.1-0.1
  • assembly_verkko_1.1-0.1-freeze/
    - Frozen version
  • assembly_verkko_1.1-0.2/
    - Version 1.1-0.2
  • assembly_verkko_1.4.1r/
    - Revised version 1.4.1
Clade-Specific Directories (2023+ specialized assemblies):
  • assembly_primate_v1.4.2/
    - Primate-specific pipeline
  • assembly_fish_*
    - Fish-specific (potential)
  • assembly_bird_*
    - Bird-specific (potential)
Institution-Specific Directories:
  • assembly_rockefeller/
    - Rockefeller University assemblies
  • assembly_cambridge/
    - Cambridge assemblies
  • assembly_MT_rockefeller/
    - Case variation
  • assembly_mt_rockefeller/
    - Lowercase variation
  • assembly_mt_milan/
    - Milan institute
Directories Without "assembly_" Prefix (rare):
  • vgp_standard_1.6/
    - Standard v1.6 without prefix
  • vgp_standard_1.0/
    - Standard v1.0 without prefix
  • vgp_HiC_1.6/
    - Hi-C v1.6 without prefix
Curated Assemblies (post-manual curation):
  • assembly_curated/
    - Exclude for date extraction (post-curation dates)
CRITICAL CASE SENSITIVITY:
  • Metadata may store:
    assembly_vgp_hic_2.0
    (lowercase)
  • S3 requires:
    assembly_vgp_HiC_2.0
    (mixed case!)
  • Always normalize before fetching
COMPREHENSIVE PATTERN MATCHING:
  • Don't stop at first match: Try ALL valid paths
  • Pri/alt assemblies often use legacy versions (1.6, 1.0)
  • Phased assemblies typically use version 2.0
  • Verkko assemblies are diploid, use different naming
  • Coverage improvement: Using all patterns -> 47-62% vs 27% with basic patterns
assembly_vgp_{type}_{version} - 标准VGP命名模式:
  • assembly_vgp_HiC_2.0
    - Hi-C相位组装(区分大小写!)
  • assembly_vgp_standard_2.0
    - 无Hi-C的标准组装
  • assembly_vgp_hic_2.0
    - Hi-C命名的变体格式
  • assembly_vgp_trio_2.0
    - Trio分箱组装
旧版本(2019-2021年的组装数据):
  • assembly_vgp_standard_1.6
    - 1.6版本(常见于鱼类、鸟类)
  • assembly_vgp_standard_1.0
    - 1.0版本(早期组装数据)
  • assembly_vgp_HiC_1.6
    - Hi-C 1.6版本
  • assembly_vgp_HiC_1.0
    - Hi-C 1.0版本
  • assembly_vgp_HiC_1.4
    - Hi-C 1.4版本
Verkko组装数据(二倍体组装):
  • assembly_verkko_1.4/
    - Verkko 1.4版本
  • assembly_verkko_1.1-0.1/
    - Verkko 1.1-0.1版本
  • assembly_verkko_1.1-0.1-freeze/
    - 冻结版本
  • assembly_verkko_1.1-0.2/
    - 1.1-0.2版本
  • assembly_verkko_1.4.1r/
    - 修订版1.4.1
分支特定目录(2023年以后的专用组装数据):
  • assembly_primate_v1.4.2/
    - 灵长类专用流程
  • assembly_fish_*
    - 鱼类专用(潜在格式)
  • assembly_bird_*
    - 鸟类专用(潜在格式)
机构特定目录
  • assembly_rockefeller/
    - 洛克菲勒大学组装数据
  • assembly_cambridge/
    - 剑桥大学组装数据
  • assembly_MT_rockefeller/
    - 大小写变体格式
  • assembly_mt_rockefeller/
    - 小写变体格式
  • assembly_mt_milan/
    - 米兰研究所组装数据
无"assembly_"前缀的目录(罕见情况):
  • vgp_standard_1.6/
    - 无前缀的标准1.6版本
  • vgp_standard_1.0/
    - 无前缀的标准1.0版本
  • vgp_HiC_1.6/
    - 无前缀的Hi-C 1.6版本
人工校正组装数据(人工后期校正):
  • assembly_curated/
    - 提取日期时请排除(校正后日期)
关键注意:区分大小写
  • 元数据中可能存储为:
    assembly_vgp_hic_2.0
    (小写)
  • S3实际要求:
    assembly_vgp_HiC_2.0
    (大小写混合!)
  • 获取前务必标准化路径
全面模式匹配
  • 不要在首次匹配后停止:尝试所有有效路径
  • 主要/备选组装数据通常使用旧版本(1.6、1.0)
  • 相位组装数据通常使用2.0版本
  • Verkko组装数据为二倍体,命名规则不同
  • 覆盖率提升:使用所有模式的覆盖率为47-62%,而基础模式仅为27%

Data Access Summary

数据访问摘要

For detailed fetching code and parsing logic, see qc-data-fetching.md.
Data TypeLocationKey Notes
GenomeScope
evaluation/genomescope/
3 filename patterns (double/single/no underscore); validate heterozygosity ranges
BUSCO
evaluation/busco/{subdir}/
Dynamic subdir search (c/, p/, c1/, p1/); parse
C:XX.X%
Merqury
evaluation/merqury/
Two path layouts (direct vs nested); QV in column 4
Meryl hist
intermediates/meryl/
Use
.hist
file only (~700KB), not full database (~10GB)
Assembly datesFASTA filenamesYYYYMMDD stamps; see assembly-date-extraction.md
Technology
genomic_data/
subfolders
pacbio_hifi/
-> HiFi,
ont/
-> ONT, etc.
如需详细的获取代码与解析逻辑,请查阅qc-data-fetching.md
数据类型位置关键说明
GenomeScope
evaluation/genomescope/
3种文件名模式(双下划线/单下划线/无下划线);需验证杂合度范围
BUSCO
evaluation/busco/{subdir}/
动态搜索子目录(c/、p/、c1/、p1/);解析
C:XX.X%
格式
Merqury
evaluation/merqury/
两种路径布局(直接/嵌套);QV值位于第4列
Meryl直方图
intermediates/meryl/
仅使用
.hist
文件(约700KB),而非完整数据库(约10GB)
组装日期FASTA文件名YYYYMMDD时间戳;请查阅assembly-date-extraction.md
测序技术
genomic_data/
子文件夹
pacbio_hifi/
-> HiFi测序,
ont/
-> ONT测序等

Path Normalization (used by all fetching functions)

路径标准化(所有获取函数均使用)

python
def normalize_s3_path(s3_path):
    """Normalize path for GenomeArk (case sensitivity!)"""
    if not s3_path:
        return None
    s3_path = s3_path.replace('/assembly_vgp_hic_2.0/', '/assembly_vgp_HiC_2.0/')
    if not s3_path.endswith('/'):
        s3_path += '/'
    return s3_path
python
def normalize_s3_path(s3_path):
    """标准化GenomeArk的路径(注意区分大小写!)"""
    if not s3_path:
        return None
    s3_path = s3_path.replace('/assembly_vgp_hic_2.0/', '/assembly_vgp_HiC_2.0/')
    if not s3_path.endswith('/'):
        s3_path += '/'
    return s3_path

GenomeScope Filename Patterns (TRY ALL THREE!)

GenomeScope文件名模式(请尝试全部三种!)

  • Pattern A:
    {ToLID}_genomescope__Summary.txt
    (double underscore, most common)
  • Pattern C:
    {ToLID}_genomescope_Summary.txt
    (single underscore, easily missed)
  • Pattern B:
    {ToLID}_Summary.txt
    (no prefix, older assemblies)
Checking only A and B causes ~30-40% of data to be missed.
  • 模式A:
    {ToLID}_genomescope__Summary.txt
    (双下划线,最常见)
  • 模式C:
    {ToLID}_genomescope_Summary.txt
    (单下划线,容易遗漏)
  • 模式B:
    {ToLID}_Summary.txt
    (无前缀,旧版组装数据)
仅检查模式A和B会导致约30-40%的数据被遗漏。

GenomeScope Validation

GenomeScope验证

Reject failed runs where heterozygosity range > 50% or max > 95%. A range of 0%-100% indicates complete model failure.
拒绝杂合度范围>50%或最大值>95%的失败运行。0%-100%的范围表示模型完全失效。

Meryl Histograms - Direct HTTPS URLs (for Galaxy import)

Meryl直方图 - 直接HTTPS URL(用于Galaxy导入)

https://genomeark.s3.amazonaws.com/species/{species}/{tolid}/assembly_vgp_standard_1.0/intermediates/meryl/{tolid}.cut.meryl.hist
https://genomeark.s3.amazonaws.com/species/{species}/{tolid}/assembly_vgp_standard_1.0/intermediates/meryl/{tolid}.cut.meryl.hist

Quick Reference

快速参考

AWS CLI pattern (prefer over boto3 for public buckets):
python
cmd = ['aws', 's3', 'cp', s3_path, '-', '--no-sign-request']
result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
Rate limiting: 0.2s delay between requests.
Common pitfalls: Case sensitivity (
hic
vs
HiC
), directory evolution (2022 vs 2024 layouts), downloading full meryl databases instead of
.hist
files. See best-practices.md for full list.
AWS CLI模式(访问公开存储桶时优先于boto3):
python
cmd = ['aws', 's3', 'cp', s3_path, '-', '--no-sign-request']
result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
速率限制:请求间隔0.2秒。
常见陷阱:区分大小写(
hic
vs
HiC
)、目录结构演变(2022年与2024年布局差异)、下载完整meryl数据库而非
.hist
文件。完整列表请查阅best-practices.md