genomeark-aws
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGenomeArk AWS S3 Data Repository
GenomeArk AWS S3数据仓库
Comprehensive guide for accessing and navigating the GenomeArk AWS S3 public bucket containing Vertebrate Genomes Project (VGP) assemblies and quality control data.
Supporting files (read as needed for detailed code and strategies):
- assembly-date-extraction.md - Extract assembly dates from FASTA filenames, validation rules
- qc-data-fetching.md - GenomeScope, BUSCO, Merqury, Meryl fetching code and parsing
- best-practices.md - AWS CLI patterns, batch processing, common pitfalls, testing examples, version history
本文是访问和浏览GenomeArk AWS S3公开存储桶的综合指南,该存储桶包含脊椎动物基因组项目(VGP)的组装数据与质量控制数据。
支持文件(按需查阅以获取详细代码与策略):
- assembly-date-extraction.md - 从FASTA文件名中提取组装日期及验证规则
- qc-data-fetching.md - GenomeScope、BUSCO、Merqury、Meryl的数据获取代码与解析方法
- best-practices.md - AWS CLI使用模式、批量处理、常见陷阱、测试示例及版本历史
When to Use This Skill
何时使用本技能
Use this skill when:
- Accessing VGP genome assemblies from GenomeArk AWS S3
- Fetching QC metrics (GenomeScope, BUSCO, Merqury) for genomic analyses
- Downloading genome evaluation data for comparative studies
- Accessing meryl k-mer histograms for GenomeScope analysis
- Building automated pipelines that fetch VGP data
- Troubleshooting S3 path issues or missing data
- Working with species-specific genome data from VGP
在以下场景中使用本技能:
- 从GenomeArk AWS S3获取VGP基因组组装数据
- 提取用于基因组分析的QC指标(GenomeScope、BUSCO、Merqury)
- 下载用于比较研究的基因组评估数据
- 获取用于GenomeScope分析的meryl k-mer直方图
- 构建可自动获取VGP数据的流水线
- 排查S3路径问题或数据缺失情况
- 处理VGP的物种特异性基因组数据
Repository Overview
仓库概述
GenomeArk is a public AWS S3 bucket () hosting:
s3://genomeark/- VGP genome assemblies (primary, alternate, trio)
- Quality control metrics (GenomeScope, BUSCO, Merqury)
- Intermediate files (meryl databases, k-mer histograms)
- Assembly evaluation reports
- Haplotype-resolved assemblies
Access Method: Public bucket requiring no AWS credentials when using
--no-sign-requestCritical Discovery: GenomeArk structure has evolved over time (2022 -> 2024+). Always implement fallback path patterns for reliability.
GenomeArk是一个公开的AWS S3存储桶(),托管内容包括:
s3://genomeark/- VGP基因组组装数据(主要组装、备选组装、 trio组装)
- 质量控制指标(GenomeScope、BUSCO、Merqury)
- 中间文件(meryl数据库、k-mer直方图)
- 组装评估报告
- 单倍型解析组装数据
访问方式:公开存储桶,使用参数时无需AWS凭证
--no-sign-request重要发现:GenomeArk的目录结构随时间演变(2022年 -> 2024年以后)。为保证可靠性,务必实现回退路径模式。
Directory Structure
目录结构
Base Structure
基础结构
s3://genomeark/
└── species/
└── {Genus_species}/ # e.g., Rhinolophus_ferrumequinum
└── {ToLID}/ # e.g., mRhiFer1 (VGP specimen ID)
├── assembly_vgp_{type}_{version}/
│ ├── evaluation/ # QC metrics (MAIN ACCESS POINT)
│ │ ├── genomescope/
│ │ ├── busco/
│ │ ├── merqury/
│ │ └── ...
│ └── intermediates/ # K-mer databases, temp files
│ └── meryl/
└── genomic_data/ # Raw sequencing data folderss3://genomeark/
└── species/
└── {Genus_species}/ # 示例:Rhinolophus_ferrumequinum
└── {ToLID}/ # 示例:mRhiFer1(VGP样本ID)
├── assembly_vgp_{type}_{version}/
│ ├── evaluation/ # QC指标(主要访问入口)
│ │ ├── genomescope/
│ │ ├── busco/
│ │ ├── merqury/
│ │ └── ...
│ └── intermediates/ # K-mer数据库、临时文件
│ └── meryl/
└── genomic_data/ # 原始测序数据文件夹Assembly Directory Variations
组装目录变体
assembly_vgp_{type}_{version} - Standard VGP Patterns:
- - Hi-C phased assembly (case-sensitive!)
assembly_vgp_HiC_2.0 - - Standard assembly without Hi-C
assembly_vgp_standard_2.0 - - Alternative Hi-C naming
assembly_vgp_hic_2.0 - - Trio-binned assembly
assembly_vgp_trio_2.0
Legacy Versions (2019-2021 assemblies):
- - Version 1.6 (common in fish, birds)
assembly_vgp_standard_1.6 - - Version 1.0 (early assemblies)
assembly_vgp_standard_1.0 - - Hi-C version 1.6
assembly_vgp_HiC_1.6 - - Hi-C version 1.0
assembly_vgp_HiC_1.0 - - Hi-C version 1.4
assembly_vgp_HiC_1.4
Verkko Assemblies (diploid assemblies):
- - Verkko version 1.4
assembly_verkko_1.4/ - - Verkko version 1.1-0.1
assembly_verkko_1.1-0.1/ - - Frozen version
assembly_verkko_1.1-0.1-freeze/ - - Version 1.1-0.2
assembly_verkko_1.1-0.2/ - - Revised version 1.4.1
assembly_verkko_1.4.1r/
Clade-Specific Directories (2023+ specialized assemblies):
- - Primate-specific pipeline
assembly_primate_v1.4.2/ - - Fish-specific (potential)
assembly_fish_* - - Bird-specific (potential)
assembly_bird_*
Institution-Specific Directories:
- - Rockefeller University assemblies
assembly_rockefeller/ - - Cambridge assemblies
assembly_cambridge/ - - Case variation
assembly_MT_rockefeller/ - - Lowercase variation
assembly_mt_rockefeller/ - - Milan institute
assembly_mt_milan/
Directories Without "assembly_" Prefix (rare):
- - Standard v1.6 without prefix
vgp_standard_1.6/ - - Standard v1.0 without prefix
vgp_standard_1.0/ - - Hi-C v1.6 without prefix
vgp_HiC_1.6/
Curated Assemblies (post-manual curation):
- - Exclude for date extraction (post-curation dates)
assembly_curated/
CRITICAL CASE SENSITIVITY:
- Metadata may store: (lowercase)
assembly_vgp_hic_2.0 - S3 requires: (mixed case!)
assembly_vgp_HiC_2.0 - Always normalize before fetching
COMPREHENSIVE PATTERN MATCHING:
- Don't stop at first match: Try ALL valid paths
- Pri/alt assemblies often use legacy versions (1.6, 1.0)
- Phased assemblies typically use version 2.0
- Verkko assemblies are diploid, use different naming
- Coverage improvement: Using all patterns -> 47-62% vs 27% with basic patterns
assembly_vgp_{type}_{version} - 标准VGP命名模式:
- - Hi-C相位组装(区分大小写!)
assembly_vgp_HiC_2.0 - - 无Hi-C的标准组装
assembly_vgp_standard_2.0 - - Hi-C命名的变体格式
assembly_vgp_hic_2.0 - - Trio分箱组装
assembly_vgp_trio_2.0
旧版本(2019-2021年的组装数据):
- - 1.6版本(常见于鱼类、鸟类)
assembly_vgp_standard_1.6 - - 1.0版本(早期组装数据)
assembly_vgp_standard_1.0 - - Hi-C 1.6版本
assembly_vgp_HiC_1.6 - - Hi-C 1.0版本
assembly_vgp_HiC_1.0 - - Hi-C 1.4版本
assembly_vgp_HiC_1.4
Verkko组装数据(二倍体组装):
- - Verkko 1.4版本
assembly_verkko_1.4/ - - Verkko 1.1-0.1版本
assembly_verkko_1.1-0.1/ - - 冻结版本
assembly_verkko_1.1-0.1-freeze/ - - 1.1-0.2版本
assembly_verkko_1.1-0.2/ - - 修订版1.4.1
assembly_verkko_1.4.1r/
分支特定目录(2023年以后的专用组装数据):
- - 灵长类专用流程
assembly_primate_v1.4.2/ - - 鱼类专用(潜在格式)
assembly_fish_* - - 鸟类专用(潜在格式)
assembly_bird_*
机构特定目录:
- - 洛克菲勒大学组装数据
assembly_rockefeller/ - - 剑桥大学组装数据
assembly_cambridge/ - - 大小写变体格式
assembly_MT_rockefeller/ - - 小写变体格式
assembly_mt_rockefeller/ - - 米兰研究所组装数据
assembly_mt_milan/
无"assembly_"前缀的目录(罕见情况):
- - 无前缀的标准1.6版本
vgp_standard_1.6/ - - 无前缀的标准1.0版本
vgp_standard_1.0/ - - 无前缀的Hi-C 1.6版本
vgp_HiC_1.6/
人工校正组装数据(人工后期校正):
- - 提取日期时请排除(校正后日期)
assembly_curated/
关键注意:区分大小写:
- 元数据中可能存储为:(小写)
assembly_vgp_hic_2.0 - S3实际要求:(大小写混合!)
assembly_vgp_HiC_2.0 - 获取前务必标准化路径
全面模式匹配:
- 不要在首次匹配后停止:尝试所有有效路径
- 主要/备选组装数据通常使用旧版本(1.6、1.0)
- 相位组装数据通常使用2.0版本
- Verkko组装数据为二倍体,命名规则不同
- 覆盖率提升:使用所有模式的覆盖率为47-62%,而基础模式仅为27%
Data Access Summary
数据访问摘要
For detailed fetching code and parsing logic, see qc-data-fetching.md.
| Data Type | Location | Key Notes |
|---|---|---|
| GenomeScope | | 3 filename patterns (double/single/no underscore); validate heterozygosity ranges |
| BUSCO | | Dynamic subdir search (c/, p/, c1/, p1/); parse |
| Merqury | | Two path layouts (direct vs nested); QV in column 4 |
| Meryl hist | | Use |
| Assembly dates | FASTA filenames | YYYYMMDD stamps; see assembly-date-extraction.md |
| Technology | | |
如需详细的获取代码与解析逻辑,请查阅qc-data-fetching.md。
| 数据类型 | 位置 | 关键说明 |
|---|---|---|
| GenomeScope | | 3种文件名模式(双下划线/单下划线/无下划线);需验证杂合度范围 |
| BUSCO | | 动态搜索子目录(c/、p/、c1/、p1/);解析 |
| Merqury | | 两种路径布局(直接/嵌套);QV值位于第4列 |
| Meryl直方图 | | 仅使用 |
| 组装日期 | FASTA文件名 | YYYYMMDD时间戳;请查阅assembly-date-extraction.md |
| 测序技术 | | |
Path Normalization (used by all fetching functions)
路径标准化(所有获取函数均使用)
python
def normalize_s3_path(s3_path):
"""Normalize path for GenomeArk (case sensitivity!)"""
if not s3_path:
return None
s3_path = s3_path.replace('/assembly_vgp_hic_2.0/', '/assembly_vgp_HiC_2.0/')
if not s3_path.endswith('/'):
s3_path += '/'
return s3_pathpython
def normalize_s3_path(s3_path):
"""标准化GenomeArk的路径(注意区分大小写!)"""
if not s3_path:
return None
s3_path = s3_path.replace('/assembly_vgp_hic_2.0/', '/assembly_vgp_HiC_2.0/')
if not s3_path.endswith('/'):
s3_path += '/'
return s3_pathGenomeScope Filename Patterns (TRY ALL THREE!)
GenomeScope文件名模式(请尝试全部三种!)
- Pattern A: (double underscore, most common)
{ToLID}_genomescope__Summary.txt - Pattern C: (single underscore, easily missed)
{ToLID}_genomescope_Summary.txt - Pattern B: (no prefix, older assemblies)
{ToLID}_Summary.txt
Checking only A and B causes ~30-40% of data to be missed.
- 模式A:(双下划线,最常见)
{ToLID}_genomescope__Summary.txt - 模式C:(单下划线,容易遗漏)
{ToLID}_genomescope_Summary.txt - 模式B:(无前缀,旧版组装数据)
{ToLID}_Summary.txt
仅检查模式A和B会导致约30-40%的数据被遗漏。
GenomeScope Validation
GenomeScope验证
Reject failed runs where heterozygosity range > 50% or max > 95%. A range of 0%-100% indicates complete model failure.
拒绝杂合度范围>50%或最大值>95%的失败运行。0%-100%的范围表示模型完全失效。
Meryl Histograms - Direct HTTPS URLs (for Galaxy import)
Meryl直方图 - 直接HTTPS URL(用于Galaxy导入)
https://genomeark.s3.amazonaws.com/species/{species}/{tolid}/assembly_vgp_standard_1.0/intermediates/meryl/{tolid}.cut.meryl.histhttps://genomeark.s3.amazonaws.com/species/{species}/{tolid}/assembly_vgp_standard_1.0/intermediates/meryl/{tolid}.cut.meryl.histQuick Reference
快速参考
AWS CLI pattern (prefer over boto3 for public buckets):
python
cmd = ['aws', 's3', 'cp', s3_path, '-', '--no-sign-request']
result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)Rate limiting: 0.2s delay between requests.
Common pitfalls: Case sensitivity ( vs ), directory evolution (2022 vs 2024 layouts), downloading full meryl databases instead of files. See best-practices.md for full list.
hicHiC.histAWS CLI模式(访问公开存储桶时优先于boto3):
python
cmd = ['aws', 's3', 'cp', s3_path, '-', '--no-sign-request']
result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)速率限制:请求间隔0.2秒。
常见陷阱:区分大小写( vs )、目录结构演变(2022年与2024年布局差异)、下载完整meryl数据库而非文件。完整列表请查阅best-practices.md。
hicHiC.hist