geniml

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Geniml: Genomic Interval Machine Learning

Geniml:基因组区间机器学习

Overview

概述

Geniml is a Python package for building machine learning models on genomic interval data from BED files. It provides unsupervised methods for learning embeddings of genomic regions, single cells, and metadata labels, enabling similarity searches, clustering, and downstream ML tasks.
Geniml是一个Python包,用于基于BED文件的基因组区间数据构建机器学习模型。它提供无监督方法来学习基因组区域、单细胞和元数据标签的嵌入,支持相似性搜索、聚类和下游机器学习任务。

Installation

安装

Install geniml using uv:
bash
uv uv pip install geniml
For ML dependencies (PyTorch, etc.):
bash
uv uv pip install 'geniml[ml]'
Development version from GitHub:
bash
uv uv pip install git+https://github.com/databio/geniml.git
使用uv安装geniml:
bash
uv uv pip install geniml
安装机器学习依赖(PyTorch等):
bash
uv uv pip install 'geniml[ml]'
从GitHub安装开发版本:
bash
uv uv pip install git+https://github.com/databio/geniml.git

Core Capabilities

核心功能

Geniml provides five primary capabilities, each detailed in dedicated reference files:
Geniml提供五项核心功能,每项功能的详细说明见专属参考文档:

1. Region2Vec: Genomic Region Embeddings

1. Region2Vec:基因组区域嵌入

Train unsupervised embeddings of genomic regions using word2vec-style learning.
Use for: Dimensionality reduction of BED files, region similarity analysis, feature vectors for downstream ML.
Workflow:
  1. Tokenize BED files using a universe reference
  2. Train Region2Vec model on tokens
  3. Generate embeddings for regions
Reference: See
references/region2vec.md
for detailed workflow, parameters, and examples.
使用word2vec风格的学习方法训练基因组区域的无监督嵌入。
适用场景: BED文件降维、区域相似性分析、下游机器学习的特征向量生成。
工作流:
  1. 使用参考universe对BED文件进行分词
  2. 在分词数据上训练Region2Vec模型
  3. 生成区域嵌入
参考文档: 详见
references/region2vec.md
获取详细工作流、参数及示例。

2. BEDspace: Joint Region and Metadata Embeddings

2. BEDspace:联合区域与元数据嵌入

Train shared embeddings for region sets and metadata labels using StarSpace.
Use for: Metadata-aware searches, cross-modal queries (region→label or label→region), joint analysis of genomic content and experimental conditions.
Workflow:
  1. Preprocess regions and metadata
  2. Train BEDspace model
  3. Compute distances
  4. Query across regions and labels
Reference: See
references/bedspace.md
for detailed workflow, search types, and examples.
使用StarSpace训练区域集与元数据标签的共享嵌入。
适用场景: 元数据感知搜索、跨模态查询(区域→标签或标签→区域)、基因组内容与实验条件的联合分析。
工作流:
  1. 预处理区域与元数据
  2. 训练BEDspace模型
  3. 计算距离
  4. 跨区域与标签进行查询
参考文档: 详见
references/bedspace.md
获取详细工作流、搜索类型及示例。

3. scEmbed: Single-Cell Chromatin Accessibility Embeddings

3. scEmbed:单细胞染色质可及性嵌入

Train Region2Vec models on single-cell ATAC-seq data for cell-level embeddings.
Use for: scATAC-seq clustering, cell-type annotation, dimensionality reduction of single cells, integration with scanpy workflows.
Workflow:
  1. Prepare AnnData with peak coordinates
  2. Pre-tokenize cells
  3. Train scEmbed model
  4. Generate cell embeddings
  5. Cluster and visualize with scanpy
Reference: See
references/scembed.md
for detailed workflow, parameters, and examples.
在单细胞ATAC-seq数据上训练Region2Vec模型,以生成细胞级嵌入。
适用场景: scATAC-seq聚类、细胞类型注释、单细胞数据降维、与scanpy工作流集成。
工作流:
  1. 加载包含峰坐标的AnnData数据
  2. 对细胞进行预分词
  3. 训练scEmbed模型
  4. 生成细胞嵌入
  5. 使用scanpy进行聚类与可视化
参考文档: 详见
references/scembed.md
获取详细工作流、参数及示例。

4. Consensus Peaks: Universe Building

4. 共识峰:Universe构建

Build reference peak sets (universes) from BED file collections using multiple statistical methods.
Use for: Creating tokenization references, standardizing regions across datasets, defining consensus features with statistical rigor.
Workflow:
  1. Combine BED files
  2. Generate coverage tracks
  3. Build universe using CC, CCF, ML, or HMM method
Methods:
  • CC (Coverage Cutoff): Simple threshold-based
  • CCF (Coverage Cutoff Flexible): Confidence intervals for boundaries
  • ML (Maximum Likelihood): Probabilistic modeling of positions
  • HMM (Hidden Markov Model): Complex state modeling
Reference: See
references/consensus_peaks.md
for method comparison, parameters, and examples.
使用多种统计方法从BED文件集合构建参考峰集(universes)。
适用场景: 生成分词参考、跨数据集标准化区域、通过统计方法定义共识特征。
工作流:
  1. 合并BED文件
  2. 生成覆盖度轨迹
  3. 使用CC、CCF、ML或HMM方法构建universe
方法说明:
  • CC(Coverage Cutoff): 基于简单阈值的方法
  • CCF(Coverage Cutoff Flexible): 带边界置信区间的方法
  • ML(Maximum Likelihood): 位置概率建模
  • HMM(Hidden Markov Model): 复杂状态建模
参考文档: 详见
references/consensus_peaks.md
获取方法对比、参数及示例。

5. Utilities: Supporting Tools

5. 工具集:辅助工具

Additional tools for caching, randomization, evaluation, and search.
Available utilities:
  • BBClient: BED file caching for repeated access
  • BEDshift: Randomization preserving genomic context
  • Evaluation: Metrics for embedding quality (silhouette, Davies-Bouldin, etc.)
  • Tokenization: Region tokenization utilities (hard, soft, universe-based)
  • Text2BedNN: Neural search backends for genomic queries
Reference: See
references/utilities.md
for detailed usage of each utility.
用于缓存、随机化、评估与搜索的附加工具。
可用工具:
  • BBClient: BED文件缓存,支持重复访问
  • BEDshift: 保留基因组上下文的随机化工具
  • Evaluation: 嵌入质量评估指标(轮廓系数、Davies-Bouldin指数等)
  • Tokenization: 区域分词工具(硬分词、软分词、基于universe的分词)
  • Text2BedNN: 基因组查询的神经搜索后端
参考文档: 详见
references/utilities.md
获取各工具的详细用法。

Common Workflows

常见工作流

Basic Region Embedding Pipeline

基础区域嵌入流程

python
from geniml.tokenization import hard_tokenization
from geniml.region2vec import region2vec
from geniml.evaluation import evaluate_embeddings
python
from geniml.tokenization import hard_tokenization
from geniml.region2vec import region2vec
from geniml.evaluation import evaluate_embeddings

Step 1: Tokenize BED files

Step 1: Tokenize BED files

hard_tokenization( src_folder='bed_files/', dst_folder='tokens/', universe_file='universe.bed', p_value_threshold=1e-9 )
hard_tokenization( src_folder='bed_files/', dst_folder='tokens/', universe_file='universe.bed', p_value_threshold=1e-9 )

Step 2: Train Region2Vec

Step 2: Train Region2Vec

region2vec( token_folder='tokens/', save_dir='model/', num_shufflings=1000, embedding_dim=100 )
region2vec( token_folder='tokens/', save_dir='model/', num_shufflings=1000, embedding_dim=100 )

Step 3: Evaluate

Step 3: Evaluate

metrics = evaluate_embeddings( embeddings_file='model/embeddings.npy', labels_file='metadata.csv' )
undefined
metrics = evaluate_embeddings( embeddings_file='model/embeddings.npy', labels_file='metadata.csv' )
undefined

scATAC-seq Analysis Pipeline

scATAC-seq分析流程

python
import scanpy as sc
from geniml.scembed import ScEmbed
from geniml.io import tokenize_cells
python
import scanpy as sc
from geniml.scembed import ScEmbed
from geniml.io import tokenize_cells

Step 1: Load data

Step 1: Load data

adata = sc.read_h5ad('scatac_data.h5ad')
adata = sc.read_h5ad('scatac_data.h5ad')

Step 2: Tokenize cells

Step 2: Tokenize cells

tokenize_cells( adata='scatac_data.h5ad', universe_file='universe.bed', output='tokens.parquet' )
tokenize_cells( adata='scatac_data.h5ad', universe_file='universe.bed', output='tokens.parquet' )

Step 3: Train scEmbed

Step 3: Train scEmbed

model = ScEmbed(embedding_dim=100) model.train(dataset='tokens.parquet', epochs=100)
model = ScEmbed(embedding_dim=100) model.train(dataset='tokens.parquet', epochs=100)

Step 4: Generate embeddings

Step 4: Generate embeddings

embeddings = model.encode(adata) adata.obsm['scembed_X'] = embeddings
embeddings = model.encode(adata) adata.obsm['scembed_X'] = embeddings

Step 5: Cluster with scanpy

Step 5: Cluster with scanpy

sc.pp.neighbors(adata, use_rep='scembed_X') sc.tl.leiden(adata) sc.tl.umap(adata)
undefined
sc.pp.neighbors(adata, use_rep='scembed_X') sc.tl.leiden(adata) sc.tl.umap(adata)
undefined

Universe Building and Evaluation

Universe构建与评估

bash
undefined
bash
undefined

Generate coverage

Generate coverage

cat bed_files/*.bed > combined.bed uniwig -m 25 combined.bed chrom.sizes coverage/
cat bed_files/*.bed > combined.bed uniwig -m 25 combined.bed chrom.sizes coverage/

Build universe with coverage cutoff

Build universe with coverage cutoff

geniml universe build cc
--coverage-folder coverage/
--output-file universe.bed
--cutoff 5
--merge 100
--filter-size 50
geniml universe build cc
--coverage-folder coverage/
--output-file universe.bed
--cutoff 5
--merge 100
--filter-size 50

Evaluate universe quality

Evaluate universe quality

geniml universe evaluate
--universe universe.bed
--coverage-folder coverage/
--bed-folder bed_files/
undefined
geniml universe evaluate
--universe universe.bed
--coverage-folder coverage/
--bed-folder bed_files/
undefined

CLI Reference

CLI参考

Geniml provides command-line interfaces for major operations:
bash
undefined
Geniml为主要操作提供命令行接口:
bash
undefined

Region2Vec training

Region2Vec training

geniml region2vec --token-folder tokens/ --save-dir model/ --num-shuffle 1000
geniml region2vec --token-folder tokens/ --save-dir model/ --num-shuffle 1000

BEDspace preprocessing

BEDspace preprocessing

geniml bedspace preprocess --input regions/ --metadata labels.csv --universe universe.bed
geniml bedspace preprocess --input regions/ --metadata labels.csv --universe universe.bed

BEDspace training

BEDspace training

geniml bedspace train --input preprocessed.txt --output model/ --dim 100
geniml bedspace train --input preprocessed.txt --output model/ --dim 100

BEDspace search

BEDspace search

geniml bedspace search -t r2l -d distances.pkl -q query.bed -n 10
geniml bedspace search -t r2l -d distances.pkl -q query.bed -n 10

Universe building

Universe building

geniml universe build cc --coverage-folder coverage/ --output universe.bed --cutoff 5
geniml universe build cc --coverage-folder coverage/ --output universe.bed --cutoff 5

BEDshift randomization

BEDshift randomization

geniml bedshift --input peaks.bed --genome hg38 --preserve-chrom --iterations 100
undefined
geniml bedshift --input peaks.bed --genome hg38 --preserve-chrom --iterations 100
undefined

When to Use Which Tool

工具选择指南

Use Region2Vec when:
  • Working with bulk genomic data (ChIP-seq, ATAC-seq, etc.)
  • Need unsupervised embeddings without metadata
  • Comparing region sets across experiments
  • Building features for downstream supervised learning
Use BEDspace when:
  • Metadata labels available (cell types, tissues, conditions)
  • Need to query regions by metadata or vice versa
  • Want joint embedding space for regions and labels
  • Building searchable genomic databases
Use scEmbed when:
  • Analyzing single-cell ATAC-seq data
  • Clustering cells by chromatin accessibility
  • Annotating cell types from scATAC-seq
  • Integration with scanpy is desired
Use Universe Building when:
  • Need reference peak sets for tokenization
  • Combining multiple experiments into consensus
  • Want statistically rigorous region definitions
  • Building standard references for a project
Use Utilities when:
  • Need to cache remote BED files (BBClient)
  • Generating null models for statistics (BEDshift)
  • Evaluating embedding quality (Evaluation)
  • Building search interfaces (Text2BedNN)
当以下场景时使用Region2Vec:
  • 处理批量基因组数据(ChIP-seq、ATAC-seq等)
  • 无需元数据的无监督嵌入需求
  • 跨实验比较区域集
  • 为下游监督学习构建特征
当以下场景时使用BEDspace:
  • 有元数据标签可用(细胞类型、组织、实验条件)
  • 需要通过元数据查询区域,或反之
  • 希望构建区域与标签的联合嵌入空间
  • 构建可搜索的基因组数据库
当以下场景时使用scEmbed:
  • 分析单细胞ATAC-seq数据
  • 基于染色质可及性对细胞进行聚类
  • 从scATAC-seq数据注释细胞类型
  • 需要与scanpy集成
当以下场景时使用Universe构建:
  • 需要用于分词的参考峰集
  • 将多个实验合并为共识集
  • 希望通过统计方法定义严谨的区域
  • 为项目构建标准化参考
当以下场景时使用工具集:
  • 需要缓存远程BED文件(BBClient)
  • 为统计分析生成空模型(BEDshift)
  • 评估嵌入质量(Evaluation)
  • 构建搜索接口(Text2BedNN)

Best Practices

最佳实践

General Guidelines

通用指南

  • Universe quality is critical: Invest time in building comprehensive, well-constructed universes
  • Tokenization validation: Check coverage (>80% ideal) before training
  • Parameter tuning: Experiment with embedding dimensions, learning rates, and training epochs
  • Evaluation: Always validate embeddings with multiple metrics and visualizations
  • Documentation: Record parameters and random seeds for reproducibility
  • Universe质量至关重要: 投入时间构建全面、高质量的universe
  • 分词验证: 训练前检查覆盖度(理想值>80%)
  • 参数调优: 尝试不同的嵌入维度、学习率和训练轮次
  • 评估: 始终使用多种指标和可视化方法验证嵌入效果
  • 文档记录: 记录参数和随机种子以保证可复现性

Performance Considerations

性能优化建议

  • Pre-tokenization: For scEmbed, always pre-tokenize cells for faster training
  • Memory management: Large datasets may require batch processing or downsampling
  • Computational resources: ML/HMM universe methods are computationally intensive
  • Model caching: Use BBClient to avoid repeated downloads
  • 预分词: 对于scEmbed,始终先对细胞进行预分词以加快训练速度
  • 内存管理: 大型数据集可能需要批量处理或下采样
  • 计算资源: ML/HMM universe方法计算量较大
  • 模型缓存: 使用BBClient避免重复下载

Integration Patterns

集成模式

  • With scanpy: scEmbed embeddings integrate seamlessly as
    adata.obsm
    entries
  • With BEDbase: Use BBClient for accessing remote BED repositories
  • With Hugging Face: Export trained models for sharing and reproducibility
  • With R: Use reticulate for R integration (see utilities reference)
  • 与scanpy集成: scEmbed嵌入可无缝作为
    adata.obsm
    条目使用
  • 与BEDbase集成: 使用BBClient访问远程BED仓库
  • 与Hugging Face集成: 导出训练好的模型以便共享和复现
  • 与R集成: 使用reticulate实现R语言集成(详见工具集参考文档)

Related Projects

相关项目

Geniml is part of the BEDbase ecosystem:
  • BEDbase: Unified platform for genomic regions
  • BEDboss: Processing pipeline for BED files
  • Gtars: Genomic tools and utilities
  • BBClient: Client for BEDbase repositories
Geniml是BEDbase生态系统的一部分:
  • BEDbase: 基因组区域统一平台
  • BEDboss: BED文件处理流程
  • Gtars: 基因组工具集
  • BBClient: BEDbase仓库客户端

Additional Resources

额外资源

Troubleshooting

故障排除

"Tokenization coverage too low":
  • Check universe quality and completeness
  • Adjust p-value threshold (try 1e-6 instead of 1e-9)
  • Ensure universe matches genome assembly
"Training not converging":
  • Adjust learning rate (try 0.01-0.05 range)
  • Increase training epochs
  • Check data quality and preprocessing
"Out of memory errors":
  • Reduce batch size for scEmbed
  • Process data in chunks
  • Use pre-tokenization for single-cell data
"StarSpace not found" (BEDspace):
For detailed troubleshooting and method-specific issues, consult the appropriate reference file.
“分词覆盖度过低”:
  • 检查universe的质量和完整性
  • 调整p值阈值(尝试1e-6而非1e-9)
  • 确保universe与基因组组装版本匹配
“训练不收敛”:
  • 调整学习率(尝试0.01-0.05范围)
  • 增加训练轮次
  • 检查数据质量和预处理步骤
“内存不足错误”:
  • 减小scEmbed的批量大小
  • 分块处理数据
  • 对单细胞数据使用预分词
“StarSpace未找到”(BEDspace):
如需详细故障排除和特定方法的问题解决,请查阅对应的参考文档。