geniml
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGeniml: Genomic Interval Machine Learning
Geniml:基因组区间机器学习
Overview
概述
Geniml is a Python package for building machine learning models on genomic interval data from BED files. It provides unsupervised methods for learning embeddings of genomic regions, single cells, and metadata labels, enabling similarity searches, clustering, and downstream ML tasks.
Geniml是一个Python包,用于基于BED文件的基因组区间数据构建机器学习模型。它提供无监督方法来学习基因组区域、单细胞和元数据标签的嵌入,支持相似性搜索、聚类和下游机器学习任务。
Installation
安装
Install geniml using uv:
bash
uv uv pip install genimlFor ML dependencies (PyTorch, etc.):
bash
uv uv pip install 'geniml[ml]'Development version from GitHub:
bash
uv uv pip install git+https://github.com/databio/geniml.git使用uv安装geniml:
bash
uv uv pip install geniml安装机器学习依赖(PyTorch等):
bash
uv uv pip install 'geniml[ml]'从GitHub安装开发版本:
bash
uv uv pip install git+https://github.com/databio/geniml.gitCore Capabilities
核心功能
Geniml provides five primary capabilities, each detailed in dedicated reference files:
Geniml提供五项核心功能,每项功能的详细说明见专属参考文档:
1. Region2Vec: Genomic Region Embeddings
1. Region2Vec:基因组区域嵌入
Train unsupervised embeddings of genomic regions using word2vec-style learning.
Use for: Dimensionality reduction of BED files, region similarity analysis, feature vectors for downstream ML.
Workflow:
- Tokenize BED files using a universe reference
- Train Region2Vec model on tokens
- Generate embeddings for regions
Reference: See for detailed workflow, parameters, and examples.
references/region2vec.md使用word2vec风格的学习方法训练基因组区域的无监督嵌入。
适用场景: BED文件降维、区域相似性分析、下游机器学习的特征向量生成。
工作流:
- 使用参考universe对BED文件进行分词
- 在分词数据上训练Region2Vec模型
- 生成区域嵌入
参考文档: 详见获取详细工作流、参数及示例。
references/region2vec.md2. BEDspace: Joint Region and Metadata Embeddings
2. BEDspace:联合区域与元数据嵌入
Train shared embeddings for region sets and metadata labels using StarSpace.
Use for: Metadata-aware searches, cross-modal queries (region→label or label→region), joint analysis of genomic content and experimental conditions.
Workflow:
- Preprocess regions and metadata
- Train BEDspace model
- Compute distances
- Query across regions and labels
Reference: See for detailed workflow, search types, and examples.
references/bedspace.md使用StarSpace训练区域集与元数据标签的共享嵌入。
适用场景: 元数据感知搜索、跨模态查询(区域→标签或标签→区域)、基因组内容与实验条件的联合分析。
工作流:
- 预处理区域与元数据
- 训练BEDspace模型
- 计算距离
- 跨区域与标签进行查询
参考文档: 详见获取详细工作流、搜索类型及示例。
references/bedspace.md3. scEmbed: Single-Cell Chromatin Accessibility Embeddings
3. scEmbed:单细胞染色质可及性嵌入
Train Region2Vec models on single-cell ATAC-seq data for cell-level embeddings.
Use for: scATAC-seq clustering, cell-type annotation, dimensionality reduction of single cells, integration with scanpy workflows.
Workflow:
- Prepare AnnData with peak coordinates
- Pre-tokenize cells
- Train scEmbed model
- Generate cell embeddings
- Cluster and visualize with scanpy
Reference: See for detailed workflow, parameters, and examples.
references/scembed.md在单细胞ATAC-seq数据上训练Region2Vec模型,以生成细胞级嵌入。
适用场景: scATAC-seq聚类、细胞类型注释、单细胞数据降维、与scanpy工作流集成。
工作流:
- 加载包含峰坐标的AnnData数据
- 对细胞进行预分词
- 训练scEmbed模型
- 生成细胞嵌入
- 使用scanpy进行聚类与可视化
参考文档: 详见获取详细工作流、参数及示例。
references/scembed.md4. Consensus Peaks: Universe Building
4. 共识峰:Universe构建
Build reference peak sets (universes) from BED file collections using multiple statistical methods.
Use for: Creating tokenization references, standardizing regions across datasets, defining consensus features with statistical rigor.
Workflow:
- Combine BED files
- Generate coverage tracks
- Build universe using CC, CCF, ML, or HMM method
Methods:
- CC (Coverage Cutoff): Simple threshold-based
- CCF (Coverage Cutoff Flexible): Confidence intervals for boundaries
- ML (Maximum Likelihood): Probabilistic modeling of positions
- HMM (Hidden Markov Model): Complex state modeling
Reference: See for method comparison, parameters, and examples.
references/consensus_peaks.md使用多种统计方法从BED文件集合构建参考峰集(universes)。
适用场景: 生成分词参考、跨数据集标准化区域、通过统计方法定义共识特征。
工作流:
- 合并BED文件
- 生成覆盖度轨迹
- 使用CC、CCF、ML或HMM方法构建universe
方法说明:
- CC(Coverage Cutoff): 基于简单阈值的方法
- CCF(Coverage Cutoff Flexible): 带边界置信区间的方法
- ML(Maximum Likelihood): 位置概率建模
- HMM(Hidden Markov Model): 复杂状态建模
参考文档: 详见获取方法对比、参数及示例。
references/consensus_peaks.md5. Utilities: Supporting Tools
5. 工具集:辅助工具
Additional tools for caching, randomization, evaluation, and search.
Available utilities:
- BBClient: BED file caching for repeated access
- BEDshift: Randomization preserving genomic context
- Evaluation: Metrics for embedding quality (silhouette, Davies-Bouldin, etc.)
- Tokenization: Region tokenization utilities (hard, soft, universe-based)
- Text2BedNN: Neural search backends for genomic queries
Reference: See for detailed usage of each utility.
references/utilities.md用于缓存、随机化、评估与搜索的附加工具。
可用工具:
- BBClient: BED文件缓存,支持重复访问
- BEDshift: 保留基因组上下文的随机化工具
- Evaluation: 嵌入质量评估指标(轮廓系数、Davies-Bouldin指数等)
- Tokenization: 区域分词工具(硬分词、软分词、基于universe的分词)
- Text2BedNN: 基因组查询的神经搜索后端
参考文档: 详见获取各工具的详细用法。
references/utilities.mdCommon Workflows
常见工作流
Basic Region Embedding Pipeline
基础区域嵌入流程
python
from geniml.tokenization import hard_tokenization
from geniml.region2vec import region2vec
from geniml.evaluation import evaluate_embeddingspython
from geniml.tokenization import hard_tokenization
from geniml.region2vec import region2vec
from geniml.evaluation import evaluate_embeddingsStep 1: Tokenize BED files
Step 1: Tokenize BED files
hard_tokenization(
src_folder='bed_files/',
dst_folder='tokens/',
universe_file='universe.bed',
p_value_threshold=1e-9
)
hard_tokenization(
src_folder='bed_files/',
dst_folder='tokens/',
universe_file='universe.bed',
p_value_threshold=1e-9
)
Step 2: Train Region2Vec
Step 2: Train Region2Vec
region2vec(
token_folder='tokens/',
save_dir='model/',
num_shufflings=1000,
embedding_dim=100
)
region2vec(
token_folder='tokens/',
save_dir='model/',
num_shufflings=1000,
embedding_dim=100
)
Step 3: Evaluate
Step 3: Evaluate
metrics = evaluate_embeddings(
embeddings_file='model/embeddings.npy',
labels_file='metadata.csv'
)
undefinedmetrics = evaluate_embeddings(
embeddings_file='model/embeddings.npy',
labels_file='metadata.csv'
)
undefinedscATAC-seq Analysis Pipeline
scATAC-seq分析流程
python
import scanpy as sc
from geniml.scembed import ScEmbed
from geniml.io import tokenize_cellspython
import scanpy as sc
from geniml.scembed import ScEmbed
from geniml.io import tokenize_cellsStep 1: Load data
Step 1: Load data
adata = sc.read_h5ad('scatac_data.h5ad')
adata = sc.read_h5ad('scatac_data.h5ad')
Step 2: Tokenize cells
Step 2: Tokenize cells
tokenize_cells(
adata='scatac_data.h5ad',
universe_file='universe.bed',
output='tokens.parquet'
)
tokenize_cells(
adata='scatac_data.h5ad',
universe_file='universe.bed',
output='tokens.parquet'
)
Step 3: Train scEmbed
Step 3: Train scEmbed
model = ScEmbed(embedding_dim=100)
model.train(dataset='tokens.parquet', epochs=100)
model = ScEmbed(embedding_dim=100)
model.train(dataset='tokens.parquet', epochs=100)
Step 4: Generate embeddings
Step 4: Generate embeddings
embeddings = model.encode(adata)
adata.obsm['scembed_X'] = embeddings
embeddings = model.encode(adata)
adata.obsm['scembed_X'] = embeddings
Step 5: Cluster with scanpy
Step 5: Cluster with scanpy
sc.pp.neighbors(adata, use_rep='scembed_X')
sc.tl.leiden(adata)
sc.tl.umap(adata)
undefinedsc.pp.neighbors(adata, use_rep='scembed_X')
sc.tl.leiden(adata)
sc.tl.umap(adata)
undefinedUniverse Building and Evaluation
Universe构建与评估
bash
undefinedbash
undefinedGenerate coverage
Generate coverage
cat bed_files/*.bed > combined.bed
uniwig -m 25 combined.bed chrom.sizes coverage/
cat bed_files/*.bed > combined.bed
uniwig -m 25 combined.bed chrom.sizes coverage/
Build universe with coverage cutoff
Build universe with coverage cutoff
geniml universe build cc
--coverage-folder coverage/
--output-file universe.bed
--cutoff 5
--merge 100
--filter-size 50
--coverage-folder coverage/
--output-file universe.bed
--cutoff 5
--merge 100
--filter-size 50
geniml universe build cc
--coverage-folder coverage/
--output-file universe.bed
--cutoff 5
--merge 100
--filter-size 50
--coverage-folder coverage/
--output-file universe.bed
--cutoff 5
--merge 100
--filter-size 50
Evaluate universe quality
Evaluate universe quality
geniml universe evaluate
--universe universe.bed
--coverage-folder coverage/
--bed-folder bed_files/
--universe universe.bed
--coverage-folder coverage/
--bed-folder bed_files/
undefinedgeniml universe evaluate
--universe universe.bed
--coverage-folder coverage/
--bed-folder bed_files/
--universe universe.bed
--coverage-folder coverage/
--bed-folder bed_files/
undefinedCLI Reference
CLI参考
Geniml provides command-line interfaces for major operations:
bash
undefinedGeniml为主要操作提供命令行接口:
bash
undefinedRegion2Vec training
Region2Vec training
geniml region2vec --token-folder tokens/ --save-dir model/ --num-shuffle 1000
geniml region2vec --token-folder tokens/ --save-dir model/ --num-shuffle 1000
BEDspace preprocessing
BEDspace preprocessing
geniml bedspace preprocess --input regions/ --metadata labels.csv --universe universe.bed
geniml bedspace preprocess --input regions/ --metadata labels.csv --universe universe.bed
BEDspace training
BEDspace training
geniml bedspace train --input preprocessed.txt --output model/ --dim 100
geniml bedspace train --input preprocessed.txt --output model/ --dim 100
BEDspace search
BEDspace search
geniml bedspace search -t r2l -d distances.pkl -q query.bed -n 10
geniml bedspace search -t r2l -d distances.pkl -q query.bed -n 10
Universe building
Universe building
geniml universe build cc --coverage-folder coverage/ --output universe.bed --cutoff 5
geniml universe build cc --coverage-folder coverage/ --output universe.bed --cutoff 5
BEDshift randomization
BEDshift randomization
geniml bedshift --input peaks.bed --genome hg38 --preserve-chrom --iterations 100
undefinedgeniml bedshift --input peaks.bed --genome hg38 --preserve-chrom --iterations 100
undefinedWhen to Use Which Tool
工具选择指南
Use Region2Vec when:
- Working with bulk genomic data (ChIP-seq, ATAC-seq, etc.)
- Need unsupervised embeddings without metadata
- Comparing region sets across experiments
- Building features for downstream supervised learning
Use BEDspace when:
- Metadata labels available (cell types, tissues, conditions)
- Need to query regions by metadata or vice versa
- Want joint embedding space for regions and labels
- Building searchable genomic databases
Use scEmbed when:
- Analyzing single-cell ATAC-seq data
- Clustering cells by chromatin accessibility
- Annotating cell types from scATAC-seq
- Integration with scanpy is desired
Use Universe Building when:
- Need reference peak sets for tokenization
- Combining multiple experiments into consensus
- Want statistically rigorous region definitions
- Building standard references for a project
Use Utilities when:
- Need to cache remote BED files (BBClient)
- Generating null models for statistics (BEDshift)
- Evaluating embedding quality (Evaluation)
- Building search interfaces (Text2BedNN)
当以下场景时使用Region2Vec:
- 处理批量基因组数据(ChIP-seq、ATAC-seq等)
- 无需元数据的无监督嵌入需求
- 跨实验比较区域集
- 为下游监督学习构建特征
当以下场景时使用BEDspace:
- 有元数据标签可用(细胞类型、组织、实验条件)
- 需要通过元数据查询区域,或反之
- 希望构建区域与标签的联合嵌入空间
- 构建可搜索的基因组数据库
当以下场景时使用scEmbed:
- 分析单细胞ATAC-seq数据
- 基于染色质可及性对细胞进行聚类
- 从scATAC-seq数据注释细胞类型
- 需要与scanpy集成
当以下场景时使用Universe构建:
- 需要用于分词的参考峰集
- 将多个实验合并为共识集
- 希望通过统计方法定义严谨的区域
- 为项目构建标准化参考
当以下场景时使用工具集:
- 需要缓存远程BED文件(BBClient)
- 为统计分析生成空模型(BEDshift)
- 评估嵌入质量(Evaluation)
- 构建搜索接口(Text2BedNN)
Best Practices
最佳实践
General Guidelines
通用指南
- Universe quality is critical: Invest time in building comprehensive, well-constructed universes
- Tokenization validation: Check coverage (>80% ideal) before training
- Parameter tuning: Experiment with embedding dimensions, learning rates, and training epochs
- Evaluation: Always validate embeddings with multiple metrics and visualizations
- Documentation: Record parameters and random seeds for reproducibility
- Universe质量至关重要: 投入时间构建全面、高质量的universe
- 分词验证: 训练前检查覆盖度(理想值>80%)
- 参数调优: 尝试不同的嵌入维度、学习率和训练轮次
- 评估: 始终使用多种指标和可视化方法验证嵌入效果
- 文档记录: 记录参数和随机种子以保证可复现性
Performance Considerations
性能优化建议
- Pre-tokenization: For scEmbed, always pre-tokenize cells for faster training
- Memory management: Large datasets may require batch processing or downsampling
- Computational resources: ML/HMM universe methods are computationally intensive
- Model caching: Use BBClient to avoid repeated downloads
- 预分词: 对于scEmbed,始终先对细胞进行预分词以加快训练速度
- 内存管理: 大型数据集可能需要批量处理或下采样
- 计算资源: ML/HMM universe方法计算量较大
- 模型缓存: 使用BBClient避免重复下载
Integration Patterns
集成模式
- With scanpy: scEmbed embeddings integrate seamlessly as entries
adata.obsm - With BEDbase: Use BBClient for accessing remote BED repositories
- With Hugging Face: Export trained models for sharing and reproducibility
- With R: Use reticulate for R integration (see utilities reference)
- 与scanpy集成: scEmbed嵌入可无缝作为条目使用
adata.obsm - 与BEDbase集成: 使用BBClient访问远程BED仓库
- 与Hugging Face集成: 导出训练好的模型以便共享和复现
- 与R集成: 使用reticulate实现R语言集成(详见工具集参考文档)
Related Projects
相关项目
Geniml is part of the BEDbase ecosystem:
- BEDbase: Unified platform for genomic regions
- BEDboss: Processing pipeline for BED files
- Gtars: Genomic tools and utilities
- BBClient: Client for BEDbase repositories
Geniml是BEDbase生态系统的一部分:
- BEDbase: 基因组区域统一平台
- BEDboss: BED文件处理流程
- Gtars: 基因组工具集
- BBClient: BEDbase仓库客户端
Additional Resources
额外资源
- Documentation: https://docs.bedbase.org/geniml/
- GitHub: https://github.com/databio/geniml
- Pre-trained models: Available on Hugging Face (databio organization)
- Publications: Cited in documentation for methodological details
- 文档: https://docs.bedbase.org/geniml/
- GitHub: https://github.com/databio/geniml
- 预训练模型: 可在Hugging Face(databio组织)获取
- 相关文献: 方法细节详见文档中的引用文献
Troubleshooting
故障排除
"Tokenization coverage too low":
- Check universe quality and completeness
- Adjust p-value threshold (try 1e-6 instead of 1e-9)
- Ensure universe matches genome assembly
"Training not converging":
- Adjust learning rate (try 0.01-0.05 range)
- Increase training epochs
- Check data quality and preprocessing
"Out of memory errors":
- Reduce batch size for scEmbed
- Process data in chunks
- Use pre-tokenization for single-cell data
"StarSpace not found" (BEDspace):
- Install StarSpace separately: https://github.com/facebookresearch/StarSpace
- Set parameter correctly
--path-to-starspace
For detailed troubleshooting and method-specific issues, consult the appropriate reference file.
“分词覆盖度过低”:
- 检查universe的质量和完整性
- 调整p值阈值(尝试1e-6而非1e-9)
- 确保universe与基因组组装版本匹配
“训练不收敛”:
- 调整学习率(尝试0.01-0.05范围)
- 增加训练轮次
- 检查数据质量和预处理步骤
“内存不足错误”:
- 减小scEmbed的批量大小
- 分块处理数据
- 对单细胞数据使用预分词
“StarSpace未找到”(BEDspace):
- 单独安装StarSpace:https://github.com/facebookresearch/StarSpace
- 正确设置参数
--path-to-starspace
如需详细故障排除和特定方法的问题解决,请查阅对应的参考文档。