arboreto

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Arboreto

Overview

概述

Arboreto is a computational library for inferring gene regulatory networks (GRNs) from gene expression data using parallelized algorithms that scale from single machines to multi-node clusters.

Core capability: Identify which transcription factors (TFs) regulate which target genes based on expression patterns across observations (cells, samples, conditions).

Arboreto是一个计算库，可使用并行化算法从基因表达数据中推断基因调控网络（GRNs），支持从单机扩展到多节点集群。

核心能力：基于不同观测样本（细胞、样本、条件）的表达模式，识别哪些转录因子（TFs）调控哪些靶基因。

Quick Start

快速开始

Install arboreto:

bash

uv pip install arboreto

Basic GRN inference:

python

import pandas as pd
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Load expression data (genes as columns)
    expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')

    # Infer regulatory network
    network = grnboost2(expression_data=expression_matrix)

    # Save results (TF, target, importance)
    network.to_csv('network.tsv', sep='\t', index=False, header=False)

Critical: Always use

if __name__ == '__main__':

guard because Dask spawns new processes.

安装Arboreto：

bash

uv pip install arboreto

基础GRN推断：

python

import pandas as pd
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Load expression data (genes as columns)
    expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')

    # Infer regulatory network
    network = grnboost2(expression_data=expression_matrix)

    # Save results (TF, target, importance)
    network.to_csv('network.tsv', sep='\t', index=False, header=False)

重要提示：请始终使用

if __name__ == '__main__':

语句进行保护，因为Dask会生成新进程。

Core Capabilities

核心功能

1. Basic GRN Inference

1. 基础GRN推断

For standard GRN inference workflows including:

Input data preparation (Pandas DataFrame or NumPy array)
Running inference with GRNBoost2 or GENIE3
Filtering by transcription factors
Output format and interpretation

See:

references/basic_inference.md

Use the ready-to-run script:

scripts/basic_grn_inference.py

for standard inference tasks:

bash

python scripts/basic_grn_inference.py expression_data.tsv output_network.tsv --tf-file tfs.txt --seed 777

适用于标准GRN推断工作流，包括：

输入数据准备（Pandas DataFrame或NumPy数组）
使用GRNBoost2或GENIE3运行推断
按转录因子过滤
输出格式与解读

参考：

references/basic_inference.md

使用现成脚本：

scripts/basic_grn_inference.py

可用于标准推断任务：

bash

python scripts/basic_grn_inference.py expression_data.tsv output_network.tsv --tf-file tfs.txt --seed 777

2. Algorithm Selection

2. 算法选择

Arboreto provides two algorithms:

GRNBoost2 (Recommended):

Fast gradient boosting-based inference
Optimized for large datasets (10k+ observations)
Default choice for most analyses

GENIE3:

Random Forest-based inference
Original multiple regression approach
Use for comparison or validation

Quick comparison:

python

from arboreto.algo import grnboost2, genie3

Arboreto提供两种算法：

GRNBoost2（推荐）：

基于快速梯度提升的推断方法
针对大规模数据集（10k+观测样本）优化
大多数分析场景的默认选择

GENIE3：

基于随机森林的推断方法
经典的多元回归方案
用于对比或验证

快速对比：

python

from arboreto.algo import grnboost2, genie3

Fast, recommended

network_grnboost = grnboost2(expression_data=matrix)

Classic algorithm

network_genie3 = genie3(expression_data=matrix)


**For detailed algorithm comparison, parameters, and selection guidance**: `references/algorithms.md`

network_genie3 = genie3(expression_data=matrix)


**算法详细对比、参数及选择指南**：`references/algorithms.md`

3. Distributed Computing

3. 分布式计算

Scale inference from local multi-core to cluster environments:

Local (default) - Uses all available cores automatically:

python

network = grnboost2(expression_data=matrix)

Custom local client - Control resources:

python

from distributed import LocalCluster, Client

local_cluster = LocalCluster(n_workers=10, memory_limit='8GB')
client = Client(local_cluster)

network = grnboost2(expression_data=matrix, client_or_address=client)

client.close()
local_cluster.close()

Cluster computing - Connect to remote Dask scheduler:

python

from distributed import Client

client = Client('tcp://scheduler:8786')
network = grnboost2(expression_data=matrix, client_or_address=client)

For cluster setup, performance optimization, and large-scale workflows:

references/distributed_computing.md

支持从本地多核环境扩展到集群环境：

本地模式（默认） - 自动使用所有可用核心：

python

network = grnboost2(expression_data=matrix)

自定义本地客户端 - 控制资源使用：

python

from distributed import LocalCluster, Client

local_cluster = LocalCluster(n_workers=10, memory_limit='8GB')
client = Client(local_cluster)

network = grnboost2(expression_data=matrix, client_or_address=client)

client.close()
local_cluster.close()

集群计算 - 连接远程Dask调度器：

python

from distributed import Client

client = Client('tcp://scheduler:8786')
network = grnboost2(expression_data=matrix, client_or_address=client)

集群搭建、性能优化及大规模工作流指南：

references/distributed_computing.md

Installation

安装

bash

uv pip install arboreto

Dependencies: scipy, scikit-learn, numpy, pandas, dask, distributed

bash

uv pip install arboreto

依赖项：scipy、scikit-learn、numpy、pandas、dask、distributed

Common Use Cases

常见使用场景

Single-Cell RNA-seq Analysis

单细胞RNA-seq分析

python

import pandas as pd
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Load single-cell expression matrix (cells x genes)
    sc_data = pd.read_csv('scrna_counts.tsv', sep='\t')

    # Infer cell-type-specific regulatory network
    network = grnboost2(expression_data=sc_data, seed=42)

    # Filter high-confidence links
    high_confidence = network[network['importance'] > 0.5]
    high_confidence.to_csv('grn_high_confidence.tsv', sep='\t', index=False)

python

import pandas as pd
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Load single-cell expression matrix (cells x genes)
    sc_data = pd.read_csv('scrna_counts.tsv', sep='\t')

    # Infer cell-type-specific regulatory network
    network = grnboost2(expression_data=sc_data, seed=42)

    # Filter high-confidence links
    high_confidence = network[network['importance'] > 0.5]
    high_confidence.to_csv('grn_high_confidence.tsv', sep='\t', index=False)

Bulk RNA-seq with TF Filtering

带TF过滤的Bulk RNA-seq分析

python

from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Load data
    expression_data = pd.read_csv('rnaseq_tpm.tsv', sep='\t')
    tf_names = load_tf_names('human_tfs.txt')

    # Infer with TF restriction
    network = grnboost2(
        expression_data=expression_data,
        tf_names=tf_names,
        seed=123
    )

    network.to_csv('tf_target_network.tsv', sep='\t', index=False)

python

from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Load data
    expression_data = pd.read_csv('rnaseq_tpm.tsv', sep='\t')
    tf_names = load_tf_names('human_tfs.txt')

    # Infer with TF restriction
    network = grnboost2(
        expression_data=expression_data,
        tf_names=tf_names,
        seed=123
    )

    network.to_csv('tf_target_network.tsv', sep='\t', index=False)

Comparative Analysis (Multiple Conditions)

多条件对比分析

python

from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Infer networks for different conditions
    conditions = ['control', 'treatment_24h', 'treatment_48h']

    for condition in conditions:
        data = pd.read_csv(f'{condition}_expression.tsv', sep='\t')
        network = grnboost2(expression_data=data, seed=42)
        network.to_csv(f'{condition}_network.tsv', sep='\t', index=False)

python

from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Infer networks for different conditions
    conditions = ['control', 'treatment_24h', 'treatment_48h']

    for condition in conditions:
        data = pd.read_csv(f'{condition}_expression.tsv', sep='\t')
        network = grnboost2(expression_data=data, seed=42)
        network.to_csv(f'{condition}_network.tsv', sep='\t', index=False)

Output Interpretation

结果解读

Arboreto returns a DataFrame with regulatory links:

Column	Description
`TF`	Transcription factor (regulator)
`target`	Target gene
`importance`	Regulatory importance score (higher = stronger)

Filtering strategy:

Top N links per target gene
Importance threshold (e.g., > 0.5)
Statistical significance testing (permutation tests)

Arboreto返回包含调控关系的DataFrame：

列名	描述
`TF`	转录因子（调控因子）
`target`	靶基因
`importance`	调控重要性得分（得分越高，调控作用越强）

过滤策略：

每个靶基因的前N个调控关系
重要性阈值过滤（如>0.5）
统计显著性检验（置换检验）

Integration with pySCENIC

与pySCENIC集成

Arboreto is a core component of the SCENIC pipeline for single-cell regulatory network analysis:

python

undefined

Arboreto是单细胞调控网络分析SCENIC流程的核心组件：

python

undefined

Step 1: Use arboreto for GRN inference

from arboreto.algo import grnboost2 network = grnboost2(expression_data=sc_data, tf_names=tf_list)

Step 2: Use pySCENIC for regulon identification and activity scoring

(See pySCENIC documentation for downstream analysis)

undefined

undefined

Reproducibility

可重复性

Always set a seed for reproducible results:

python

network = grnboost2(expression_data=matrix, seed=777)

Run multiple seeds for robustness analysis:

python

from distributed import LocalCluster, Client

if __name__ == '__main__':
    client = Client(LocalCluster())

    seeds = [42, 123, 777]
    networks = []

    for seed in seeds:
        net = grnboost2(expression_data=matrix, client_or_address=client, seed=seed)
        networks.append(net)

    # Combine networks and filter consensus links
    consensus = analyze_consensus(networks)

请始终设置随机种子以保证结果可复现：

python

network = grnboost2(expression_data=matrix, seed=777)

运行多种子以进行稳健性分析：

python

from distributed import LocalCluster, Client

if __name__ == '__main__':
    client = Client(LocalCluster())

    seeds = [42, 123, 777]
    networks = []

    for seed in seeds:
        net = grnboost2(expression_data=matrix, client_or_address=client, seed=seed)
        networks.append(net)

    # Combine networks and filter consensus links
    consensus = analyze_consensus(networks)

Troubleshooting

故障排除

Memory errors: Reduce dataset size by filtering low-variance genes or use distributed computing

Slow performance: Use GRNBoost2 instead of GENIE3, enable distributed client, filter TF list

Dask errors: Ensure

if __name__ == '__main__':

guard is present in scripts

Empty results: Check data format (genes as columns), verify TF names match gene names

内存错误：通过过滤低变异基因缩小数据集规模，或使用分布式计算

性能缓慢：使用GRNBoost2替代GENIE3，启用分布式客户端，过滤转录因子列表

Dask错误：确保脚本中包含

if __name__ == '__main__':

语句

结果为空：检查数据格式（基因作为列），验证转录因子名称与基因名称匹配