arboreto

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Arboreto

Arboreto

Overview

概述

Arboreto is a computational library for inferring gene regulatory networks (GRNs) from gene expression data using parallelized algorithms that scale from single machines to multi-node clusters.
Core capability: Identify which transcription factors (TFs) regulate which target genes based on expression patterns across observations (cells, samples, conditions).
Arboreto是一个计算库,可使用并行化算法从基因表达数据中推断基因调控网络(GRNs),支持从单机扩展到多节点集群。
核心能力:基于不同观测样本(细胞、样本、条件)的表达模式,识别哪些转录因子(TFs)调控哪些靶基因。

Quick Start

快速开始

Install arboreto:
bash
uv pip install arboreto
Basic GRN inference:
python
import pandas as pd
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Load expression data (genes as columns)
    expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')

    # Infer regulatory network
    network = grnboost2(expression_data=expression_matrix)

    # Save results (TF, target, importance)
    network.to_csv('network.tsv', sep='\t', index=False, header=False)
Critical: Always use
if __name__ == '__main__':
guard because Dask spawns new processes.
安装Arboreto:
bash
uv pip install arboreto
基础GRN推断:
python
import pandas as pd
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Load expression data (genes as columns)
    expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')

    # Infer regulatory network
    network = grnboost2(expression_data=expression_matrix)

    # Save results (TF, target, importance)
    network.to_csv('network.tsv', sep='\t', index=False, header=False)
重要提示:请始终使用
if __name__ == '__main__':
语句进行保护,因为Dask会生成新进程。

Core Capabilities

核心功能

1. Basic GRN Inference

1. 基础GRN推断

For standard GRN inference workflows including:
  • Input data preparation (Pandas DataFrame or NumPy array)
  • Running inference with GRNBoost2 or GENIE3
  • Filtering by transcription factors
  • Output format and interpretation
See:
references/basic_inference.md
Use the ready-to-run script:
scripts/basic_grn_inference.py
for standard inference tasks:
bash
python scripts/basic_grn_inference.py expression_data.tsv output_network.tsv --tf-file tfs.txt --seed 777
适用于标准GRN推断工作流,包括:
  • 输入数据准备(Pandas DataFrame或NumPy数组)
  • 使用GRNBoost2或GENIE3运行推断
  • 按转录因子过滤
  • 输出格式与解读
参考
references/basic_inference.md
使用现成脚本
scripts/basic_grn_inference.py
可用于标准推断任务:
bash
python scripts/basic_grn_inference.py expression_data.tsv output_network.tsv --tf-file tfs.txt --seed 777

2. Algorithm Selection

2. 算法选择

Arboreto provides two algorithms:
GRNBoost2 (Recommended):
  • Fast gradient boosting-based inference
  • Optimized for large datasets (10k+ observations)
  • Default choice for most analyses
GENIE3:
  • Random Forest-based inference
  • Original multiple regression approach
  • Use for comparison or validation
Quick comparison:
python
from arboreto.algo import grnboost2, genie3
Arboreto提供两种算法:
GRNBoost2(推荐)
  • 基于快速梯度提升的推断方法
  • 针对大规模数据集(10k+观测样本)优化
  • 大多数分析场景的默认选择
GENIE3
  • 基于随机森林的推断方法
  • 经典的多元回归方案
  • 用于对比或验证
快速对比:
python
from arboreto.algo import grnboost2, genie3

Fast, recommended

Fast, recommended

network_grnboost = grnboost2(expression_data=matrix)
network_grnboost = grnboost2(expression_data=matrix)

Classic algorithm

Classic algorithm

network_genie3 = genie3(expression_data=matrix)

**For detailed algorithm comparison, parameters, and selection guidance**: `references/algorithms.md`
network_genie3 = genie3(expression_data=matrix)

**算法详细对比、参数及选择指南**:`references/algorithms.md`

3. Distributed Computing

3. 分布式计算

Scale inference from local multi-core to cluster environments:
Local (default) - Uses all available cores automatically:
python
network = grnboost2(expression_data=matrix)
Custom local client - Control resources:
python
from distributed import LocalCluster, Client

local_cluster = LocalCluster(n_workers=10, memory_limit='8GB')
client = Client(local_cluster)

network = grnboost2(expression_data=matrix, client_or_address=client)

client.close()
local_cluster.close()
Cluster computing - Connect to remote Dask scheduler:
python
from distributed import Client

client = Client('tcp://scheduler:8786')
network = grnboost2(expression_data=matrix, client_or_address=client)
For cluster setup, performance optimization, and large-scale workflows:
references/distributed_computing.md
支持从本地多核环境扩展到集群环境:
本地模式(默认) - 自动使用所有可用核心:
python
network = grnboost2(expression_data=matrix)
自定义本地客户端 - 控制资源使用:
python
from distributed import LocalCluster, Client

local_cluster = LocalCluster(n_workers=10, memory_limit='8GB')
client = Client(local_cluster)

network = grnboost2(expression_data=matrix, client_or_address=client)

client.close()
local_cluster.close()
集群计算 - 连接远程Dask调度器:
python
from distributed import Client

client = Client('tcp://scheduler:8786')
network = grnboost2(expression_data=matrix, client_or_address=client)
集群搭建、性能优化及大规模工作流指南
references/distributed_computing.md

Installation

安装

bash
uv pip install arboreto
Dependencies: scipy, scikit-learn, numpy, pandas, dask, distributed
bash
uv pip install arboreto
依赖项:scipy、scikit-learn、numpy、pandas、dask、distributed

Common Use Cases

常见使用场景

Single-Cell RNA-seq Analysis

单细胞RNA-seq分析

python
import pandas as pd
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Load single-cell expression matrix (cells x genes)
    sc_data = pd.read_csv('scrna_counts.tsv', sep='\t')

    # Infer cell-type-specific regulatory network
    network = grnboost2(expression_data=sc_data, seed=42)

    # Filter high-confidence links
    high_confidence = network[network['importance'] > 0.5]
    high_confidence.to_csv('grn_high_confidence.tsv', sep='\t', index=False)
python
import pandas as pd
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Load single-cell expression matrix (cells x genes)
    sc_data = pd.read_csv('scrna_counts.tsv', sep='\t')

    # Infer cell-type-specific regulatory network
    network = grnboost2(expression_data=sc_data, seed=42)

    # Filter high-confidence links
    high_confidence = network[network['importance'] > 0.5]
    high_confidence.to_csv('grn_high_confidence.tsv', sep='\t', index=False)

Bulk RNA-seq with TF Filtering

带TF过滤的Bulk RNA-seq分析

python
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Load data
    expression_data = pd.read_csv('rnaseq_tpm.tsv', sep='\t')
    tf_names = load_tf_names('human_tfs.txt')

    # Infer with TF restriction
    network = grnboost2(
        expression_data=expression_data,
        tf_names=tf_names,
        seed=123
    )

    network.to_csv('tf_target_network.tsv', sep='\t', index=False)
python
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Load data
    expression_data = pd.read_csv('rnaseq_tpm.tsv', sep='\t')
    tf_names = load_tf_names('human_tfs.txt')

    # Infer with TF restriction
    network = grnboost2(
        expression_data=expression_data,
        tf_names=tf_names,
        seed=123
    )

    network.to_csv('tf_target_network.tsv', sep='\t', index=False)

Comparative Analysis (Multiple Conditions)

多条件对比分析

python
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Infer networks for different conditions
    conditions = ['control', 'treatment_24h', 'treatment_48h']

    for condition in conditions:
        data = pd.read_csv(f'{condition}_expression.tsv', sep='\t')
        network = grnboost2(expression_data=data, seed=42)
        network.to_csv(f'{condition}_network.tsv', sep='\t', index=False)
python
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Infer networks for different conditions
    conditions = ['control', 'treatment_24h', 'treatment_48h']

    for condition in conditions:
        data = pd.read_csv(f'{condition}_expression.tsv', sep='\t')
        network = grnboost2(expression_data=data, seed=42)
        network.to_csv(f'{condition}_network.tsv', sep='\t', index=False)

Output Interpretation

结果解读

Arboreto returns a DataFrame with regulatory links:
ColumnDescription
TF
Transcription factor (regulator)
target
Target gene
importance
Regulatory importance score (higher = stronger)
Filtering strategy:
  • Top N links per target gene
  • Importance threshold (e.g., > 0.5)
  • Statistical significance testing (permutation tests)
Arboreto返回包含调控关系的DataFrame:
列名描述
TF
转录因子(调控因子)
target
靶基因
importance
调控重要性得分(得分越高,调控作用越强)
过滤策略
  • 每个靶基因的前N个调控关系
  • 重要性阈值过滤(如>0.5)
  • 统计显著性检验(置换检验)

Integration with pySCENIC

与pySCENIC集成

Arboreto is a core component of the SCENIC pipeline for single-cell regulatory network analysis:
python
undefined
Arboreto是单细胞调控网络分析SCENIC流程的核心组件:
python
undefined

Step 1: Use arboreto for GRN inference

Step 1: Use arboreto for GRN inference

from arboreto.algo import grnboost2 network = grnboost2(expression_data=sc_data, tf_names=tf_list)
from arboreto.algo import grnboost2 network = grnboost2(expression_data=sc_data, tf_names=tf_list)

Step 2: Use pySCENIC for regulon identification and activity scoring

Step 2: Use pySCENIC for regulon identification and activity scoring

(See pySCENIC documentation for downstream analysis)

(See pySCENIC documentation for downstream analysis)

undefined
undefined

Reproducibility

可重复性

Always set a seed for reproducible results:
python
network = grnboost2(expression_data=matrix, seed=777)
Run multiple seeds for robustness analysis:
python
from distributed import LocalCluster, Client

if __name__ == '__main__':
    client = Client(LocalCluster())

    seeds = [42, 123, 777]
    networks = []

    for seed in seeds:
        net = grnboost2(expression_data=matrix, client_or_address=client, seed=seed)
        networks.append(net)

    # Combine networks and filter consensus links
    consensus = analyze_consensus(networks)
请始终设置随机种子以保证结果可复现:
python
network = grnboost2(expression_data=matrix, seed=777)
运行多种子以进行稳健性分析:
python
from distributed import LocalCluster, Client

if __name__ == '__main__':
    client = Client(LocalCluster())

    seeds = [42, 123, 777]
    networks = []

    for seed in seeds:
        net = grnboost2(expression_data=matrix, client_or_address=client, seed=seed)
        networks.append(net)

    # Combine networks and filter consensus links
    consensus = analyze_consensus(networks)

Troubleshooting

故障排除

Memory errors: Reduce dataset size by filtering low-variance genes or use distributed computing
Slow performance: Use GRNBoost2 instead of GENIE3, enable distributed client, filter TF list
Dask errors: Ensure
if __name__ == '__main__':
guard is present in scripts
Empty results: Check data format (genes as columns), verify TF names match gene names
内存错误:通过过滤低变异基因缩小数据集规模,或使用分布式计算
性能缓慢:使用GRNBoost2替代GENIE3,启用分布式客户端,过滤转录因子列表
Dask错误:确保脚本中包含
if __name__ == '__main__':
语句
结果为空:检查数据格式(基因作为列),验证转录因子名称与基因名称匹配