arboreto
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseArboreto
Arboreto
Overview
概述
Arboreto is a computational library for inferring gene regulatory networks (GRNs) from gene expression data using parallelized algorithms that scale from single machines to multi-node clusters.
Core capability: Identify which transcription factors (TFs) regulate which target genes based on expression patterns across observations (cells, samples, conditions).
Arboreto是一个计算库,可使用并行化算法从基因表达数据中推断基因调控网络(GRNs),支持从单机扩展到多节点集群。
核心能力:基于不同观测样本(细胞、样本、条件)的表达模式,识别哪些转录因子(TFs)调控哪些靶基因。
Quick Start
快速开始
Install arboreto:
bash
uv pip install arboretoBasic GRN inference:
python
import pandas as pd
from arboreto.algo import grnboost2
if __name__ == '__main__':
# Load expression data (genes as columns)
expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
# Infer regulatory network
network = grnboost2(expression_data=expression_matrix)
# Save results (TF, target, importance)
network.to_csv('network.tsv', sep='\t', index=False, header=False)Critical: Always use guard because Dask spawns new processes.
if __name__ == '__main__':安装Arboreto:
bash
uv pip install arboreto基础GRN推断:
python
import pandas as pd
from arboreto.algo import grnboost2
if __name__ == '__main__':
# Load expression data (genes as columns)
expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
# Infer regulatory network
network = grnboost2(expression_data=expression_matrix)
# Save results (TF, target, importance)
network.to_csv('network.tsv', sep='\t', index=False, header=False)重要提示:请始终使用语句进行保护,因为Dask会生成新进程。
if __name__ == '__main__':Core Capabilities
核心功能
1. Basic GRN Inference
1. 基础GRN推断
For standard GRN inference workflows including:
- Input data preparation (Pandas DataFrame or NumPy array)
- Running inference with GRNBoost2 or GENIE3
- Filtering by transcription factors
- Output format and interpretation
See:
references/basic_inference.mdUse the ready-to-run script: for standard inference tasks:
scripts/basic_grn_inference.pybash
python scripts/basic_grn_inference.py expression_data.tsv output_network.tsv --tf-file tfs.txt --seed 777适用于标准GRN推断工作流,包括:
- 输入数据准备(Pandas DataFrame或NumPy数组)
- 使用GRNBoost2或GENIE3运行推断
- 按转录因子过滤
- 输出格式与解读
参考:
references/basic_inference.md使用现成脚本:可用于标准推断任务:
scripts/basic_grn_inference.pybash
python scripts/basic_grn_inference.py expression_data.tsv output_network.tsv --tf-file tfs.txt --seed 7772. Algorithm Selection
2. 算法选择
Arboreto provides two algorithms:
GRNBoost2 (Recommended):
- Fast gradient boosting-based inference
- Optimized for large datasets (10k+ observations)
- Default choice for most analyses
GENIE3:
- Random Forest-based inference
- Original multiple regression approach
- Use for comparison or validation
Quick comparison:
python
from arboreto.algo import grnboost2, genie3Arboreto提供两种算法:
GRNBoost2(推荐):
- 基于快速梯度提升的推断方法
- 针对大规模数据集(10k+观测样本)优化
- 大多数分析场景的默认选择
GENIE3:
- 基于随机森林的推断方法
- 经典的多元回归方案
- 用于对比或验证
快速对比:
python
from arboreto.algo import grnboost2, genie3Fast, recommended
Fast, recommended
network_grnboost = grnboost2(expression_data=matrix)
network_grnboost = grnboost2(expression_data=matrix)
Classic algorithm
Classic algorithm
network_genie3 = genie3(expression_data=matrix)
**For detailed algorithm comparison, parameters, and selection guidance**: `references/algorithms.md`network_genie3 = genie3(expression_data=matrix)
**算法详细对比、参数及选择指南**:`references/algorithms.md`3. Distributed Computing
3. 分布式计算
Scale inference from local multi-core to cluster environments:
Local (default) - Uses all available cores automatically:
python
network = grnboost2(expression_data=matrix)Custom local client - Control resources:
python
from distributed import LocalCluster, Client
local_cluster = LocalCluster(n_workers=10, memory_limit='8GB')
client = Client(local_cluster)
network = grnboost2(expression_data=matrix, client_or_address=client)
client.close()
local_cluster.close()Cluster computing - Connect to remote Dask scheduler:
python
from distributed import Client
client = Client('tcp://scheduler:8786')
network = grnboost2(expression_data=matrix, client_or_address=client)For cluster setup, performance optimization, and large-scale workflows:
references/distributed_computing.md支持从本地多核环境扩展到集群环境:
本地模式(默认) - 自动使用所有可用核心:
python
network = grnboost2(expression_data=matrix)自定义本地客户端 - 控制资源使用:
python
from distributed import LocalCluster, Client
local_cluster = LocalCluster(n_workers=10, memory_limit='8GB')
client = Client(local_cluster)
network = grnboost2(expression_data=matrix, client_or_address=client)
client.close()
local_cluster.close()集群计算 - 连接远程Dask调度器:
python
from distributed import Client
client = Client('tcp://scheduler:8786')
network = grnboost2(expression_data=matrix, client_or_address=client)集群搭建、性能优化及大规模工作流指南:
references/distributed_computing.mdInstallation
安装
bash
uv pip install arboretoDependencies: scipy, scikit-learn, numpy, pandas, dask, distributed
bash
uv pip install arboreto依赖项:scipy、scikit-learn、numpy、pandas、dask、distributed
Common Use Cases
常见使用场景
Single-Cell RNA-seq Analysis
单细胞RNA-seq分析
python
import pandas as pd
from arboreto.algo import grnboost2
if __name__ == '__main__':
# Load single-cell expression matrix (cells x genes)
sc_data = pd.read_csv('scrna_counts.tsv', sep='\t')
# Infer cell-type-specific regulatory network
network = grnboost2(expression_data=sc_data, seed=42)
# Filter high-confidence links
high_confidence = network[network['importance'] > 0.5]
high_confidence.to_csv('grn_high_confidence.tsv', sep='\t', index=False)python
import pandas as pd
from arboreto.algo import grnboost2
if __name__ == '__main__':
# Load single-cell expression matrix (cells x genes)
sc_data = pd.read_csv('scrna_counts.tsv', sep='\t')
# Infer cell-type-specific regulatory network
network = grnboost2(expression_data=sc_data, seed=42)
# Filter high-confidence links
high_confidence = network[network['importance'] > 0.5]
high_confidence.to_csv('grn_high_confidence.tsv', sep='\t', index=False)Bulk RNA-seq with TF Filtering
带TF过滤的Bulk RNA-seq分析
python
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2
if __name__ == '__main__':
# Load data
expression_data = pd.read_csv('rnaseq_tpm.tsv', sep='\t')
tf_names = load_tf_names('human_tfs.txt')
# Infer with TF restriction
network = grnboost2(
expression_data=expression_data,
tf_names=tf_names,
seed=123
)
network.to_csv('tf_target_network.tsv', sep='\t', index=False)python
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2
if __name__ == '__main__':
# Load data
expression_data = pd.read_csv('rnaseq_tpm.tsv', sep='\t')
tf_names = load_tf_names('human_tfs.txt')
# Infer with TF restriction
network = grnboost2(
expression_data=expression_data,
tf_names=tf_names,
seed=123
)
network.to_csv('tf_target_network.tsv', sep='\t', index=False)Comparative Analysis (Multiple Conditions)
多条件对比分析
python
from arboreto.algo import grnboost2
if __name__ == '__main__':
# Infer networks for different conditions
conditions = ['control', 'treatment_24h', 'treatment_48h']
for condition in conditions:
data = pd.read_csv(f'{condition}_expression.tsv', sep='\t')
network = grnboost2(expression_data=data, seed=42)
network.to_csv(f'{condition}_network.tsv', sep='\t', index=False)python
from arboreto.algo import grnboost2
if __name__ == '__main__':
# Infer networks for different conditions
conditions = ['control', 'treatment_24h', 'treatment_48h']
for condition in conditions:
data = pd.read_csv(f'{condition}_expression.tsv', sep='\t')
network = grnboost2(expression_data=data, seed=42)
network.to_csv(f'{condition}_network.tsv', sep='\t', index=False)Output Interpretation
结果解读
Arboreto returns a DataFrame with regulatory links:
| Column | Description |
|---|---|
| Transcription factor (regulator) |
| Target gene |
| Regulatory importance score (higher = stronger) |
Filtering strategy:
- Top N links per target gene
- Importance threshold (e.g., > 0.5)
- Statistical significance testing (permutation tests)
Arboreto返回包含调控关系的DataFrame:
| 列名 | 描述 |
|---|---|
| 转录因子(调控因子) |
| 靶基因 |
| 调控重要性得分(得分越高,调控作用越强) |
过滤策略:
- 每个靶基因的前N个调控关系
- 重要性阈值过滤(如>0.5)
- 统计显著性检验(置换检验)
Integration with pySCENIC
与pySCENIC集成
Arboreto is a core component of the SCENIC pipeline for single-cell regulatory network analysis:
python
undefinedArboreto是单细胞调控网络分析SCENIC流程的核心组件:
python
undefinedStep 1: Use arboreto for GRN inference
Step 1: Use arboreto for GRN inference
from arboreto.algo import grnboost2
network = grnboost2(expression_data=sc_data, tf_names=tf_list)
from arboreto.algo import grnboost2
network = grnboost2(expression_data=sc_data, tf_names=tf_list)
Step 2: Use pySCENIC for regulon identification and activity scoring
Step 2: Use pySCENIC for regulon identification and activity scoring
(See pySCENIC documentation for downstream analysis)
(See pySCENIC documentation for downstream analysis)
undefinedundefinedReproducibility
可重复性
Always set a seed for reproducible results:
python
network = grnboost2(expression_data=matrix, seed=777)Run multiple seeds for robustness analysis:
python
from distributed import LocalCluster, Client
if __name__ == '__main__':
client = Client(LocalCluster())
seeds = [42, 123, 777]
networks = []
for seed in seeds:
net = grnboost2(expression_data=matrix, client_or_address=client, seed=seed)
networks.append(net)
# Combine networks and filter consensus links
consensus = analyze_consensus(networks)请始终设置随机种子以保证结果可复现:
python
network = grnboost2(expression_data=matrix, seed=777)运行多种子以进行稳健性分析:
python
from distributed import LocalCluster, Client
if __name__ == '__main__':
client = Client(LocalCluster())
seeds = [42, 123, 777]
networks = []
for seed in seeds:
net = grnboost2(expression_data=matrix, client_or_address=client, seed=seed)
networks.append(net)
# Combine networks and filter consensus links
consensus = analyze_consensus(networks)Troubleshooting
故障排除
Memory errors: Reduce dataset size by filtering low-variance genes or use distributed computing
Slow performance: Use GRNBoost2 instead of GENIE3, enable distributed client, filter TF list
Dask errors: Ensure guard is present in scripts
if __name__ == '__main__':Empty results: Check data format (genes as columns), verify TF names match gene names
内存错误:通过过滤低变异基因缩小数据集规模,或使用分布式计算
性能缓慢:使用GRNBoost2替代GENIE3,启用分布式客户端,过滤转录因子列表
Dask错误:确保脚本中包含语句
if __name__ == '__main__':结果为空:检查数据格式(基因作为列),验证转录因子名称与基因名称匹配