gtars
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGtars: Genomic Tools and Algorithms in Rust
Gtars: 基于Rust的基因组工具与算法
Overview
概述
Gtars is a high-performance Rust toolkit for manipulating, analyzing, and processing genomic interval data. It provides specialized tools for overlap detection, coverage analysis, tokenization for machine learning, and reference sequence management.
Use this skill when working with:
- Genomic interval files (BED format)
- Overlap detection between genomic regions
- Coverage track generation (WIG, BigWig)
- Genomic ML preprocessing and tokenization
- Fragment analysis in single-cell genomics
- Reference sequence retrieval and validation
Gtars是一款基于Rust开发的高性能工具包,用于基因组区间数据的操作、分析与处理。它提供了用于重叠检测、覆盖度分析、机器学习标记化以及参考序列管理的专用工具。
适用于以下场景:
- 基因组区间文件(BED格式)
- 基因组区域间的重叠检测
- 覆盖度轨迹生成(WIG、BigWig格式)
- 基因组机器学习预处理与标记化
- 单细胞基因组学中的片段分析
- 参考序列检索与验证
Installation
安装
Python Installation
Python安装
Install gtars Python bindings:
bash
uv uv pip install gtars安装Gtars的Python绑定:
bash
uv uv pip install gtarsCLI Installation
CLI安装
Install command-line tools (requires Rust/Cargo):
bash
undefined安装命令行工具(需要Rust/Cargo环境):
bash
undefinedInstall with all features
安装全部功能
cargo install gtars-cli --features "uniwig overlaprs igd bbcache scoring fragsplit"
cargo install gtars-cli --features "uniwig overlaprs igd bbcache scoring fragsplit"
Or install specific features only
或仅安装特定功能
cargo install gtars-cli --features "uniwig overlaprs"
undefinedcargo install gtars-cli --features "uniwig overlaprs"
undefinedRust Library
Rust库
Add to Cargo.toml for Rust projects:
toml
[dependencies]
gtars = { version = "0.1", features = ["tokenizers", "overlaprs"] }在Rust项目的Cargo.toml中添加依赖:
toml
[dependencies]
gtars = { version = "0.1", features = ["tokenizers", "overlaprs"] }Core Capabilities
核心功能
Gtars is organized into specialized modules, each focused on specific genomic analysis tasks:
Gtars由多个专用模块组成,每个模块专注于特定的基因组分析任务:
1. Overlap Detection and IGD Indexing
1. 重叠检测与IGD索引
Efficiently detect overlaps between genomic intervals using the Integrated Genome Database (IGD) data structure.
When to use:
- Finding overlapping regulatory elements
- Variant annotation
- Comparing ChIP-seq peaks
- Identifying shared genomic features
Quick example:
python
import gtars使用集成基因组数据库(IGD)数据结构高效检测基因组区间之间的重叠。
适用场景:
- 寻找重叠的调控元件
- 变异注释
- 比较ChIP-seq峰
- 识别共享的基因组特征
快速示例:
python
import gtarsBuild IGD index and query overlaps
构建IGD索引并查询重叠区域
igd = gtars.igd.build_index("regions.bed")
overlaps = igd.query("chr1", 1000, 2000)
See `references/overlap.md` for comprehensive overlap detection documentation.igd = gtars.igd.build_index("regions.bed")
overlaps = igd.query("chr1", 1000, 2000)
详细的重叠检测文档请参考`references/overlap.md`。2. Coverage Track Generation
2. 覆盖度轨迹生成
Generate coverage tracks from sequencing data with the uniwig module.
When to use:
- ATAC-seq accessibility profiles
- ChIP-seq coverage visualization
- RNA-seq read coverage
- Differential coverage analysis
Quick example:
bash
undefined通过uniwig模块从测序数据生成覆盖度轨迹。
适用场景:
- ATAC-seq可及性图谱
- ChIP-seq覆盖度可视化
- RNA-seq读取覆盖度
- 差异覆盖度分析
快速示例:
bash
undefinedGenerate BigWig coverage track
生成BigWig格式的覆盖度轨迹
gtars uniwig generate --input fragments.bed --output coverage.bw --format bigwig
See `references/coverage.md` for detailed coverage analysis workflows.gtars uniwig generate --input fragments.bed --output coverage.bw --format bigwig
详细的覆盖度分析工作流请参考`references/coverage.md`。3. Genomic Tokenization
3. 基因组标记化
Convert genomic regions into discrete tokens for machine learning applications, particularly for deep learning models on genomic data.
When to use:
- Preprocessing for genomic ML models
- Integration with geniml library
- Creating position encodings
- Training transformer models on genomic sequences
Quick example:
python
from gtars.tokenizers import TreeTokenizer
tokenizer = TreeTokenizer.from_bed_file("training_regions.bed")
token = tokenizer.tokenize("chr1", 1000, 2000)See for tokenization documentation.
references/tokenizers.md将基因组区域转换为离散标记,用于机器学习应用,尤其适用于基因组数据的深度学习模型。
适用场景:
- 基因组ML模型预处理
- 与geniml库集成
- 创建位置编码
- 在基因组序列上训练Transformer模型
快速示例:
python
from gtars.tokenizers import TreeTokenizer
tokenizer = TreeTokenizer.from_bed_file("training_regions.bed")
token = tokenizer.tokenize("chr1", 1000, 2000)标记化相关文档请参考。
references/tokenizers.md4. Reference Sequence Management
4. 参考序列管理
Handle reference genome sequences and compute digests following the GA4GH refget protocol.
When to use:
- Validating reference genome integrity
- Extracting specific genomic sequences
- Computing sequence digests
- Cross-reference comparisons
Quick example:
python
undefined处理参考基因组序列,并遵循GA4GH refget协议计算摘要。
适用场景:
- 验证参考基因组完整性
- 提取特定基因组序列
- 计算序列摘要
- 交叉参考比较
快速示例:
python
undefinedLoad reference and extract sequences
加载参考基因组并提取序列
store = gtars.RefgetStore.from_fasta("hg38.fa")
sequence = store.get_subsequence("chr1", 1000, 2000)
See `references/refget.md` for reference sequence operations.store = gtars.RefgetStore.from_fasta("hg38.fa")
sequence = store.get_subsequence("chr1", 1000, 2000)
参考序列操作相关文档请参考`references/refget.md`。5. Fragment Processing
5. 片段处理
Split and analyze fragment files, particularly useful for single-cell genomics data.
When to use:
- Processing single-cell ATAC-seq data
- Splitting fragments by cell barcodes
- Cluster-based fragment analysis
- Fragment quality control
Quick example:
bash
undefined拆分并分析片段文件,尤其适用于单细胞基因组学数据。
适用场景:
- 处理单细胞ATAC-seq数据
- 按细胞条形码拆分片段
- 基于聚类的片段分析
- 片段质量控制
快速示例:
bash
undefinedSplit fragments by clusters
按聚类拆分片段
gtars fragsplit cluster-split --input fragments.tsv --clusters clusters.txt --output-dir ./by_cluster/
See `references/cli.md` for fragment processing commands.gtars fragsplit cluster-split --input fragments.tsv --clusters clusters.txt --output-dir ./by_cluster/
片段处理命令请参考`references/cli.md`。6. Fragment Scoring
6. 片段评分
Score fragment overlaps against reference datasets.
When to use:
- Evaluating fragment enrichment
- Comparing experimental data to references
- Quality metrics computation
- Batch scoring across samples
Quick example:
bash
undefined针对参考数据集对片段重叠进行评分。
适用场景:
- 评估片段富集程度
- 比较实验数据与参考数据
- 计算质量指标
- 跨样本批量评分
快速示例:
bash
undefinedScore fragments against reference
针对参考数据集对片段进行评分
gtars scoring score --fragments fragments.bed --reference reference.bed --output scores.txt
undefinedgtars scoring score --fragments fragments.bed --reference reference.bed --output scores.txt
undefinedCommon Workflows
常见工作流
Workflow 1: Peak Overlap Analysis
工作流1:峰重叠分析
Identify overlapping genomic features:
python
import gtars识别重叠的基因组特征:
python
import gtarsLoad two region sets
加载两组区域数据
peaks = gtars.RegionSet.from_bed("chip_peaks.bed")
promoters = gtars.RegionSet.from_bed("promoters.bed")
peaks = gtars.RegionSet.from_bed("chip_peaks.bed")
promoters = gtars.RegionSet.from_bed("promoters.bed")
Find overlaps
寻找重叠区域
overlapping_peaks = peaks.filter_overlapping(promoters)
overlapping_peaks = peaks.filter_overlapping(promoters)
Export results
导出结果
overlapping_peaks.to_bed("peaks_in_promoters.bed")
undefinedoverlapping_peaks.to_bed("peaks_in_promoters.bed")
undefinedWorkflow 2: Coverage Track Pipeline
工作流2:覆盖度轨迹流水线
Generate coverage tracks for visualization:
bash
undefined生成用于可视化的覆盖度轨迹:
bash
undefinedStep 1: Generate coverage
步骤1:生成覆盖度数据
gtars uniwig generate --input atac_fragments.bed --output coverage.wig --resolution 10
gtars uniwig generate --input atac_fragments.bed --output coverage.wig --resolution 10
Step 2: Convert to BigWig for genome browsers
步骤2:转换为基因组浏览器支持的BigWig格式
gtars uniwig generate --input atac_fragments.bed --output coverage.bw --format bigwig
undefinedgtars uniwig generate --input atac_fragments.bed --output coverage.bw --format bigwig
undefinedWorkflow 3: ML Preprocessing
工作流3:ML预处理
Prepare genomic data for machine learning:
python
from gtars.tokenizers import TreeTokenizer
import gtars为机器学习准备基因组数据:
python
from gtars.tokenizers import TreeTokenizer
import gtarsStep 1: Load training regions
步骤1:加载训练区域数据
regions = gtars.RegionSet.from_bed("training_peaks.bed")
regions = gtars.RegionSet.from_bed("training_peaks.bed")
Step 2: Create tokenizer
步骤2:创建标记器
tokenizer = TreeTokenizer.from_bed_file("training_peaks.bed")
tokenizer = TreeTokenizer.from_bed_file("training_peaks.bed")
Step 3: Tokenize regions
步骤3:对区域进行标记化
tokens = [tokenizer.tokenize(r.chromosome, r.start, r.end) for r in regions]
tokens = [tokenizer.tokenize(r.chromosome, r.start, r.end) for r in regions]
Step 4: Use tokens in ML pipeline
步骤4:在ML流水线中使用标记
(integrate with geniml or custom models)
(与geniml或自定义模型集成)
undefinedundefinedPython vs CLI Usage
Python与CLI使用场景对比
Use Python API when:
- Integrating with analysis pipelines
- Need programmatic control
- Working with NumPy/Pandas
- Building custom workflows
Use CLI when:
- Quick one-off analyses
- Shell scripting
- Batch processing files
- Prototyping workflows
使用Python API的场景:
- 与分析流水线集成
- 需要程序化控制
- 与NumPy/Pandas配合使用
- 构建自定义工作流
使用CLI的场景:
- 快速的一次性分析
- Shell脚本编写
- 批量处理文件
- 工作流原型开发
Reference Documentation
参考文档
Comprehensive module documentation:
- - Complete Python API reference with RegionSet operations, NumPy integration, and data export
references/python-api.md - - IGD indexing, overlap detection, and set operations
references/overlap.md - - Coverage track generation with uniwig
references/coverage.md - - Genomic tokenization for ML applications
references/tokenizers.md - - Reference sequence management and digests
references/refget.md - - Command-line interface complete reference
references/cli.md
各模块的详细文档:
- - 完整的Python API参考,包含RegionSet操作、NumPy集成与数据导出
references/python-api.md - - IGD索引、重叠检测与集合操作
references/overlap.md - - 使用uniwig生成覆盖度轨迹
references/coverage.md - - 用于ML应用的基因组标记化
references/tokenizers.md - - 参考序列管理与摘要计算
references/refget.md - - 命令行接口完整参考
references/cli.md
Integration with geniml
与geniml的集成
Gtars serves as the foundation for the geniml Python package, providing core genomic interval operations for machine learning workflows. When working on geniml-related tasks, use gtars for data preprocessing and tokenization.
Gtars是geniml Python包的基础,为机器学习工作流提供核心的基因组区间操作。在处理geniml相关任务时,可使用Gtars进行数据预处理与标记化。
Performance Characteristics
性能特性
- Native Rust performance: Fast execution with low memory overhead
- Parallel processing: Multi-threaded operations for large datasets
- Memory efficiency: Streaming and memory-mapped file support
- Zero-copy operations: NumPy integration with minimal data copying
- 原生Rust性能:执行速度快,内存开销低
- 并行处理:针对大型数据集的多线程操作
- 内存效率:支持流式处理与内存映射文件
- 零拷贝操作:与NumPy集成,数据拷贝量极小
Data Formats
支持的数据格式
Gtars works with standard genomic formats:
- BED: Genomic intervals (3-column or extended)
- WIG/BigWig: Coverage tracks
- FASTA: Reference sequences
- Fragment TSV: Single-cell fragment files with barcodes
Gtars支持标准基因组格式:
- BED:基因组区间(3列或扩展格式)
- WIG/BigWig:覆盖度轨迹
- FASTA:参考序列
- Fragment TSV:带有条形码的单细胞片段文件
Error Handling and Debugging
错误处理与调试
Enable verbose logging for troubleshooting:
python
import gtars启用 verbose 日志进行故障排查:
python
import gtarsEnable debug logging
启用调试日志
gtars.set_log_level("DEBUG")
```bashgtars.set_log_level("DEBUG")
```bashCLI verbose mode
CLI verbose模式
gtars --verbose <command>
undefinedgtars --verbose <command>
undefined