Arboreto

Name: Arboreto
Author: k-dense-ai

k-dense-ai/scientific-agent-skills

840 installs
32k repo stars
Updated July 29, 2026
k-dense-ai/scientific-agent-skills

Arboreto is a Claude Code skill that infers gene regulatory networks from expression matrices using GRNBoost2 or GENIE3 for developers who need regulator-target predictions in computational biology pipelines.

About

Arboreto is a scientific-agent skill for gene regulatory network (GRN) inference from expression matrices inside terminal-based agent workflows. The skill documents two multiple-regression algorithms—GRNBoost2 (gradient boosting, recommended for large datasets) and GENIE3—that train a model per target gene, rank feature importance, and emit candidate regulators with scores. Developers reach for Arboreto when building single-cell or bulk RNA-seq analysis pipelines that need ranked regulator hypotheses rather than manual feature selection. GRNBoost2 targets tens-of-thousands-of-genes scale where GENIE3 would be slower. The skill fits k-dense-ai/scientific-agent-skills alongside other bioinformatics agent tools.

Documents two GRN algorithms—GRNBoost2 (gradient boosting, default for large data) and GENIE3 (random forests)—with the
Recommends GRNBoost2 for large-scale and time-constrained work (e.g. single-cell RNA-seq with tens of thousands of obser
Shared three-step strategy: per-target regression, important-feature regulators, scored TF–target–importance triplets
Includes `grnboost2()` signature and parameters (`expression_data`, `tf_names`, `seed`, `limit`, sparse matrix support)
Explains when to pick GENIE3 vs GRNBoost2 based on dataset size and runtime constraints

Arboreto by the numbers

840 all-time installs (skills.sh)
+40 installs in the week ending Jul 29, 2026 (Skillselion tracking)
Ranked #352 of 2,065 Data Science & ML skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 29, 2026 (Skillselion catalog sync)

npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill arboreto

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/k-dense-ai/scientific-agent-skills/arboreto.svg)](https://skillselion.com/skills/k-dense-ai/scientific-agent-skills/arboreto)

Installs	840
repo stars	★ 32k
Security audit	3 / 3 scanners passed
Last updated	July 29, 2026
Repository	k-dense-ai/scientific-agent-skills ↗

How do you infer gene regulatory networks from expression data?

Infer gene regulatory networks from expression matrices using GRNBoost2 or GENIE3 when you are doing computational biology or single-cell analysis from the terminal with an agent.

Who is it for?

Developers running computational biology or single-cell RNA-seq pipelines who need automated GRN inference from expression matrices in an agent terminal session.

Skip if: Developers without expression count matrices or those needing protein-protein interaction networks instead of transcriptional regulation inference.

When should I use this skill?

A developer asks to infer regulators, build a GRN, or compare GRNBoost2 versus GENIE3 on scRNA-seq or bulk expression data.

What you get

Per-target-gene ranked regulator lists with importance scores from GRNBoost2 or GENIE3 inference.

Ranked regulator lists per target gene
Regulator importance scores

By the numbers

Documents 2 GRN inference algorithms: GRNBoost2 and GENIE3
Uses a 3-step per-target-gene inference strategy: train, rank features, emit regulators

Files

SKILL.mdMarkdownGitHub ↗

Arboreto

Overview

Arboreto is a Python library from Aerts Lab for inferring gene regulatory networks (GRNs) from gene expression data. It parallelizes tree-based ensemble regression (GRNBoost2, GENIE3) with Dask across local cores or remote clusters.

Core capability: Identify which transcription factors (TFs) regulate which target genes based on expression patterns across observations (cells, samples, conditions).

Upstream: PyPI 0.1.6 (2021-02-09, latest). Docs: arboreto.readthedocs.io. Primary downstream consumer: pySCENIC.

Quick Start

Install arboreto:

uv pip install arboreto

Basic GRN inference:

import pandas as pd
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Load expression data (genes as columns)
    expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')

    # Infer regulatory network
    network = grnboost2(expression_data=expression_matrix)

    # Save results (TF, target, importance)
    network.to_csv('network.tsv', sep='\t', index=False, header=False)

Critical: Always use if __name__ == '__main__': guard because Dask spawns new processes.

Core Capabilities

1. Basic GRN Inference

For standard GRN inference workflows including:

Input data preparation (Pandas DataFrame or NumPy array)
Running inference with GRNBoost2 or GENIE3
Filtering by transcription factors
Output format and interpretation

See: references/basic_inference.md

Use the ready-to-run script: scripts/basic_grn_inference.py for standard inference tasks:

python scripts/basic_grn_inference.py expression_data.tsv output_network.tsv --tf-file tfs.txt --seed 777 --limit 5000

2. Algorithm Selection

Arboreto provides two algorithms:

GRNBoost2 (Recommended):

Fast gradient boosting-based inference
Optimized for large datasets (10k+ observations)
Default choice for most analyses

GENIE3:

Random Forest-based inference
Original multiple regression approach
Use for comparison or validation

Quick comparison:

from arboreto.algo import grnboost2, genie3

# Fast, recommended
network_grnboost = grnboost2(expression_data=matrix)

# Classic algorithm
network_genie3 = genie3(expression_data=matrix)

For detailed algorithm comparison, parameters, and selection guidance: references/algorithms.md

3. Distributed Computing

Scale inference from local multi-core to cluster environments:

Local (default) - Uses all available cores automatically:

network = grnboost2(expression_data=matrix)

Custom local client - Control resources:

from distributed import LocalCluster, Client

local_cluster = LocalCluster(n_workers=10, memory_limit='8GB')
client = Client(local_cluster)

network = grnboost2(expression_data=matrix, client_or_address=client)

client.close()
local_cluster.close()

Cluster computing - Connect to remote Dask scheduler:

from distributed import Client

client = Client('tcp://scheduler:8786')
network = grnboost2(expression_data=matrix, client_or_address=client)

For cluster setup, performance optimization, and large-scale workflows: references/distributed_computing.md

Installation

uv pip install arboreto

Conda (Bioconda):

conda install -c bioconda arboreto

Dependencies (from upstream requirements.txt): dask[complete], distributed, numpy, pandas, scikit-learn, scipy

Input formats: pandas DataFrame, dense numpy.ndarray, or sparse scipy.sparse.csc_matrix (rows = observations, columns = genes). For array/matrix inputs, pass gene_names explicitly.

Common Use Cases

Single-Cell RNA-seq Analysis

import pandas as pd
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Load single-cell expression matrix (cells x genes)
    sc_data = pd.read_csv('scrna_counts.tsv', sep='\t')

    # Infer cell-type-specific regulatory network
    network = grnboost2(expression_data=sc_data, seed=42)

    # Filter high-confidence links
    high_confidence = network[network['importance'] > 0.5]
    high_confidence.to_csv('grn_high_confidence.tsv', sep='\t', index=False)

Bulk RNA-seq with TF Filtering

from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Load data
    expression_data = pd.read_csv('rnaseq_tpm.tsv', sep='\t')
    tf_names = load_tf_names('human_tfs.txt')

    # Infer with TF restriction
    network = grnboost2(
        expression_data=expression_data,
        tf_names=tf_names,
        seed=123
    )

    network.to_csv('tf_target_network.tsv', sep='\t', index=False)

Comparative Analysis (Multiple Conditions)

from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Infer networks for different conditions
    conditions = ['control', 'treatment_24h', 'treatment_48h']

    for condition in conditions:
        data = pd.read_csv(f'{condition}_expression.tsv', sep='\t')
        network = grnboost2(expression_data=data, seed=42)
        network.to_csv(f'{condition}_network.tsv', sep='\t', index=False)

Output Interpretation

Arboreto returns a DataFrame with regulatory links:

Column	Description
`TF`	Transcription factor (regulator)
`target`	Target gene
`importance`	Regulatory importance score (higher = stronger)

Filtering strategy:

limit=N at inference time (return top N links globally)
Post-hoc importance threshold (e.g., > 0.5)
Top links per target via groupby('target')
Statistical significance testing (permutation tests, external tools)

Integration with pySCENIC

Arboreto powers the GRN inference step in pySCENIC. pySCENIC 0.11+ passes sparse expression matrices to grnboost2 / genie3; pySCENIC 0.12+ defaults to arboreto_with_multiprocessing.py (no Dask) for compatibility — use standalone arboreto when you need Dask scaling.

# Standalone: infer co-expression modules before pySCENIC cisTarget pruning
from arboreto.algo import grnboost2

network = grnboost2(expression_data=expression_df, tf_names=tf_list, limit=5000)

# Downstream: pySCENIC ctx pruning, regulon definition, AUCell (see pySCENIC docs)

Convert AnnData to a DataFrame for arboreto directly:

expression_df = adata.to_df()  # cells x genes

Reproducibility

Always set a seed for reproducible results:

network = grnboost2(expression_data=matrix, seed=777)

Run multiple seeds for robustness analysis:

from distributed import LocalCluster, Client

if __name__ == '__main__':
    client = Client(LocalCluster())

    seeds = [42, 123, 777]
    networks = []

    for seed in seeds:
        net = grnboost2(expression_data=matrix, client_or_address=client, seed=seed)
        networks.append(net)

    # Consensus: links recurring across runs (example: mean importance per TF-target pair)
    import pandas as pd
    combined = pd.concat(networks)
    consensus = (
        combined.groupby(['TF', 'target'], as_index=False)['importance']
        .mean()
        .query('importance > 0.5')
    )

Troubleshooting

Memory errors: Reduce dataset size by filtering low-variance genes or use distributed computing

Slow performance: Use GRNBoost2 instead of GENIE3, enable distributed client, filter TF list

Dask errors: Ensure if __name__ == '__main__': guard is present in scripts (required on Windows/macOS with spawn-based multiprocessing)

Empty results: Check data format (genes as columns), verify TF names match column names in the expression matrix

Sparse data: Use scipy.sparse.csc_matrix and pass matching gene_names; supported since arboreto 0.1.6 / pySCENIC 0.11

GRN Inference Algorithms

Arboreto provides two high-level algorithms for gene regulatory network (GRN) inference, both based on the multiple regression approach.

Algorithm Overview

Both algorithms follow the same inference strategy: 1. For each target gene in the dataset, train a regression model 2. Identify the most important features (potential regulators) from the model 3. Emit these features as candidate regulators with importance scores

The key difference is computational efficiency and the underlying regression method.

GRNBoost2 (Recommended)

Purpose: Fast GRN inference for large-scale datasets using gradient boosting.

When to Use

Large datasets: Tens of thousands of observations (e.g., single-cell RNA-seq)
Time-constrained analysis: Need faster results than GENIE3
Default choice: GRNBoost2 is the flagship algorithm and recommended for most use cases

Technical Details

Method: Stochastic gradient boosting with early-stopping regularization
Performance: Significantly faster than GENIE3 on large datasets
Output: Same format as GENIE3 (TF-target-importance triplets)

Usage

from arboreto.algo import grnboost2

network = grnboost2(
    expression_data=expression_matrix,
    tf_names=tf_names,
    seed=42,
    limit=5000,
)

Parameters (`grnboost2`)

grnboost2(
    expression_data,              # DataFrame, ndarray, or scipy.sparse.csc_matrix
    gene_names=None,              # Required for ndarray/sparse inputs
    tf_names='all',                 # TF list, None/'all' → all genes as regulators
    client_or_address='local',      # 'local', scheduler address, or Dask Client
    early_stop_window_length=25,    # Early-stopping window (GRNBoost2 only)
    limit=None,                     # Return top N links globally
    seed=None,                      # Random seed; None = non-deterministic
    verbose=False,
)

GENIE3

Purpose: Classic Random Forest-based GRN inference, serving as the conceptual blueprint.

When to Use

Smaller datasets: When dataset size allows for longer computation
Comparison studies: When comparing with published GENIE3 results
Validation: To validate GRNBoost2 results

Technical Details

Method: Random Forest regression (ExtraTrees available via diy)
Foundation: Original multiple regression GRN inference strategy
Trade-off: More computationally expensive but well-established

Usage

from arboreto.algo import genie3

network = genie3(
    expression_data=expression_matrix,
    tf_names=tf_names,
    seed=42,
)

Parameters (`genie3`)

genie3(
    expression_data,
    gene_names=None,
    tf_names='all',
    client_or_address='local',
    limit=None,
    seed=None,
    verbose=False,
)

Algorithm Comparison

Feature	GRNBoost2	GENIE3
Speed	Fast (optimized for large data)	Slower
Method	Gradient boosting (GBM)	Random Forest
Best for	Large-scale data (10k+ observations)	Small-medium datasets
Output format	Same	Same
Inference strategy	Multiple regression	Multiple regression
Recommended	Yes (default choice)	For comparison/validation
Early stopping	Yes (`early_stop_window_length`)	No

Advanced: Custom Regressors with `diy`

For custom scikit-learn regressor settings, use diy() (not grnboost2/genie3 kwargs):

from arboreto.algo import diy
from arboreto.core import SGBM_KWARGS, RF_KWARGS

# Custom GRNBoost2-style run
custom_gbm = diy(
    expression_data=expression_matrix,
    regressor_type='GBM',  # 'RF', 'GBM', or 'ET'
    regressor_kwargs={
        **SGBM_KWARGS,
        'n_estimators': 100,
        'max_depth': 5,
        'learning_rate': 0.1,
    },
    tf_names=tf_names,
    seed=42,
)

# Custom GENIE3-style run
custom_rf = diy(
    expression_data=expression_matrix,
    regressor_type='RF',
    regressor_kwargs={
        **RF_KWARGS,
        'n_estimators': 1000,
        'max_features': 'sqrt',
    },
    tf_names=tf_names,
)

Import default kwargs from arboreto.core and override only the keys you need.

Choosing the Right Algorithm

Decision guide:

1. Start with GRNBoost2 — faster and better suited to large single-cell datasets 2. Use GENIE3 if:

Comparing with existing GENIE3 publications
Dataset is small-medium sized
Validating GRNBoost2 results

3. Use `diy()` if you need non-default regressor hyperparameters

Both algorithms produce comparable regulatory networks with the same output format.

Basic GRN Inference with Arboreto

Input Data Requirements

Arboreto requires gene expression data in one of two formats:

Pandas DataFrame (Recommended)

Rows: Observations (cells, samples, conditions)
Columns: Genes (with gene names as column headers)
Format: Numeric expression values

Example:

import pandas as pd

# Load expression matrix with genes as columns
expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
# Columns: ['gene1', 'gene2', 'gene3', ...]
# Rows: observation data

NumPy Array

Shape: (observations, genes)
Requirement: Separately provide gene names list matching column order

Example:

import numpy as np

expression_matrix = np.genfromtxt('expression_data.tsv', delimiter='\t', skip_header=1)
with open('expression_data.tsv') as f:
    gene_names = [gene.strip() for gene in f.readline().split('\t')]

assert expression_matrix.shape[1] == len(gene_names)

Sparse CSC Matrix (arboreto 0.1.6+)

Format: scipy.sparse.csc_matrix with shape (observations, genes)
Requirement: Provide gene_names matching column order (same as NumPy)
Use case: Large single-cell matrices; also used by pySCENIC 0.11+ when --sparse is enabled

Example:

import scipy.sparse as sp
from arboreto.algo import grnboost2

# expression_sparse: csc_matrix, cells x genes
network = grnboost2(
    expression_data=expression_sparse,
    gene_names=gene_names,
    tf_names=tf_names,
)

Transcription Factors (TFs)

Optionally provide a list of transcription factor names to restrict regulatory inference:

from arboreto.utils import load_tf_names

# Load from file (one TF per line)
tf_names = load_tf_names('transcription_factors.txt')

# Or define directly
tf_names = ['TF1', 'TF2', 'TF3']

If tf_names is None or 'all', all gene_names are treated as potential regulators.

Basic Inference Workflow

Using Pandas DataFrame

import pandas as pd
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Load expression data
    expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')

    # Load transcription factors (optional)
    tf_names = load_tf_names('tf_list.txt')

    # Run GRN inference
    network = grnboost2(
        expression_data=expression_matrix,
        tf_names=tf_names  # Optional
    )

    # Save results
    network.to_csv('network_output.tsv', sep='\t', index=False, header=False)

Critical: The if __name__ == '__main__': guard is required because Dask spawns new processes internally.

Using NumPy Array

import numpy as np
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Load expression matrix
    expression_matrix = np.genfromtxt('expression_data.tsv', delimiter='\t', skip_header=1)

    # Extract gene names from header
    with open('expression_data.tsv') as f:
        gene_names = [gene.strip() for gene in f.readline().split('\t')]

    # Verify dimensions match
    assert expression_matrix.shape[1] == len(gene_names)

    # Run inference with explicit gene names
    network = grnboost2(
        expression_data=expression_matrix,
        gene_names=gene_names,
        tf_names=tf_names
    )

    network.to_csv('network_output.tsv', sep='\t', index=False, header=False)

Output Format

Arboreto returns a Pandas DataFrame with three columns:

Column	Description
`TF`	Transcription factor (regulator) gene name
`target`	Target gene name
`importance`	Regulatory importance score (higher = stronger regulation)

Example output:

TF1    gene5    0.856
TF2    gene12   0.743
TF1    gene8    0.621

Setting Random Seed

For reproducible results, pass an explicit seed (None uses random seeds per regressor):

network = grnboost2(
    expression_data=expression_matrix,
    tf_names=tf_names,
    seed=777
)

Limiting Output Size

Return only the top N regulatory links globally:

network = grnboost2(
    expression_data=expression_matrix,
    tf_names=tf_names,
    limit=5000,
)

Algorithm Selection

Use grnboost2() for most cases (faster, handles large datasets):

from arboreto.algo import grnboost2
network = grnboost2(expression_data=expression_matrix)

Use genie3() for comparison or specific requirements:

from arboreto.algo import genie3
network = genie3(expression_data=expression_matrix)

See references/algorithms.md for detailed algorithm comparison.

Distributed Computing with Arboreto

Arboreto leverages Dask for parallelized computation, enabling efficient GRN inference from single-machine multi-core processing to multi-node cluster environments.

Computation Architecture

GRN inference is inherently parallelizable:

Each target gene's regression model can be trained independently
Arboreto represents computation as a Dask task graph
Tasks are distributed across available computational resources

Local Multi-Core Processing (Default)

By default, arboreto uses all available CPU cores on the local machine:

from arboreto.algo import grnboost2

# Automatically uses all local cores
network = grnboost2(expression_data=expression_matrix, tf_names=tf_names)

This is sufficient for most use cases and requires no additional configuration.

Custom Local Dask Client

For fine-grained control over local resources, create a custom Dask client:

from distributed import LocalCluster, Client
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Configure local cluster
    local_cluster = LocalCluster(
        n_workers=10,              # Number of worker processes
        threads_per_worker=1,       # Threads per worker
        memory_limit='8GB'          # Memory limit per worker
    )

    # Create client
    custom_client = Client(local_cluster)

    # Run inference with custom client
    network = grnboost2(
        expression_data=expression_matrix,
        tf_names=tf_names,
        client_or_address=custom_client
    )

    # Clean up
    custom_client.close()
    local_cluster.close()

Benefits of Custom Client

Resource control: Limit CPU and memory usage
Multiple runs: Reuse same client for different parameter sets
Monitoring: Access Dask dashboard for performance insights

Multiple Inference Runs with Same Client

Reuse a single Dask client for multiple inference runs with different parameters:

from distributed import LocalCluster, Client
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Initialize client once
    local_cluster = LocalCluster(n_workers=8, threads_per_worker=1)
    client = Client(local_cluster)

    # Run multiple inferences
    network_seed1 = grnboost2(
        expression_data=expression_matrix,
        tf_names=tf_names,
        client_or_address=client,
        seed=666
    )

    network_seed2 = grnboost2(
        expression_data=expression_matrix,
        tf_names=tf_names,
        client_or_address=client,
        seed=777
    )

    # Different algorithms with same client
    from arboreto.algo import genie3
    network_genie3 = genie3(
        expression_data=expression_matrix,
        tf_names=tf_names,
        client_or_address=client
    )

    # Clean up once
    client.close()
    local_cluster.close()

Distributed Cluster Computing

For very large datasets, connect to a remote Dask distributed scheduler running on a cluster:

Step 1: Set Up Dask Scheduler (on cluster head node)

dask-scheduler
# Output: Scheduler at tcp://10.118.224.134:8786

Step 2: Start Dask Workers (on cluster compute nodes)

dask-worker tcp://10.118.224.134:8786

Step 3: Connect from Client

from distributed import Client
from arboreto.algo import grnboost2

if __name__ == '__main__':
    # Connect to remote scheduler
    scheduler_address = 'tcp://10.118.224.134:8786'
    cluster_client = Client(scheduler_address)

    # Run inference on cluster
    network = grnboost2(
        expression_data=expression_matrix,
        tf_names=tf_names,
        client_or_address=cluster_client
    )

    cluster_client.close()

Cluster Configuration Best Practices

Worker configuration:

dask-worker tcp://scheduler:8786 \
    --nprocs 4 \              # Number of processes per node
    --nthreads 1 \            # Threads per process
    --memory-limit 16GB       # Memory per process

For large-scale inference:

Use more workers with moderate memory rather than fewer workers with large memory
Set threads_per_worker=1 to avoid GIL contention in scikit-learn
Monitor memory usage to prevent workers from being killed

Monitoring and Debugging

Dask Dashboard

Access the Dask dashboard for real-time monitoring:

from distributed import Client

client = Client()  # Prints dashboard URL
# Dashboard available at: http://localhost:8787/status

The dashboard shows:

Task progress: Number of tasks completed/pending
Resource usage: CPU, memory per worker
Task stream: Real-time visualization of computation
Performance: Bottleneck identification

Verbose Output

Enable verbose logging to track inference progress:

network = grnboost2(
    expression_data=expression_matrix,
    tf_names=tf_names,
    verbose=True
)

Performance Optimization Tips

1. Data Format

Use Pandas DataFrame when possible: More efficient than NumPy for Dask operations
Reduce data size: Filter low-variance genes before inference

2. Worker Configuration

CPU-bound tasks: Set threads_per_worker=1, increase n_workers
Memory-bound tasks: Increase memory_limit per worker

3. Cluster Setup

Network: Ensure high-bandwidth, low-latency network between nodes
Storage: Use shared filesystem or object storage for large datasets
Scheduling: Allocate dedicated nodes to avoid resource contention

4. Transcription Factor Filtering

Limit TF list: Providing specific TF names reduces computation

# Full search (slow)
network = grnboost2(expression_data=matrix)

# Filtered search (faster)
network = grnboost2(expression_data=matrix, tf_names=known_tfs)

Example: Large-Scale Single-Cell Analysis

Complete workflow for processing single-cell RNA-seq data on a cluster:

from distributed import Client
from arboreto.algo import grnboost2
import pandas as pd

if __name__ == '__main__':
    # Connect to cluster
    client = Client('tcp://cluster-scheduler:8786')

    # Load large single-cell dataset (50,000 cells x 20,000 genes)
    expression_data = pd.read_csv('scrnaseq_data.tsv', sep='\t')

    # Load cell-type-specific TFs
    tf_names = pd.read_csv('tf_list.txt', header=None)[0].tolist()

    # Run distributed inference
    network = grnboost2(
        expression_data=expression_data,
        tf_names=tf_names,
        client_or_address=client,
        verbose=True,
        seed=42
    )

    # Save results
    network.to_csv('grn_results.tsv', sep='\t', index=False)

    client.close()

This approach enables analysis of datasets that would be impractical on a single machine.

#!/usr/bin/env python3
"""
Basic GRN inference example using Arboreto.

This script demonstrates the standard workflow for inferring gene regulatory
networks from expression data using GRNBoost2.

Usage:
    python basic_grn_inference.py <expression_file> <output_file> [--tf-file TF_FILE] [--seed SEED] [--limit LIMIT]

Arguments:
    expression_file: Path to expression matrix (TSV format, genes as columns)
    output_file: Path for output network (TSV format)
    --tf-file: Optional path to transcription factors file (one per line)
    --seed: Random seed for reproducibility (default: 777)
    --limit: Return only the top N regulatory links (optional)
"""

import argparse
import pandas as pd
from arboreto.algo import grnboost2
from arboreto.utils import load_tf_names


def run_grn_inference(expression_file, output_file, tf_file=None, seed=777, limit=None):
    """
    Run GRN inference using GRNBoost2.

    Args:
        expression_file: Path to expression matrix TSV file
        output_file: Path for output network file
        tf_file: Optional path to TF names file
        seed: Random seed for reproducibility
        limit: Optional cap on number of regulatory links returned
    """
    print(f"Loading expression data from {expression_file}...")
    expression_data = pd.read_csv(expression_file, sep='\t')

    print(f"Expression matrix shape: {expression_data.shape}")
    print(f"Number of genes: {expression_data.shape[1]}")
    print(f"Number of observations: {expression_data.shape[0]}")

    # Load TF names if provided
    tf_names = 'all'
    if tf_file:
        print(f"Loading transcription factors from {tf_file}...")
        tf_names = load_tf_names(tf_file)
        print(f"Number of TFs: {len(tf_names)}")

    # Run GRN inference
    print(f"Running GRNBoost2 with seed={seed}...")
    network = grnboost2(
        expression_data=expression_data,
        tf_names=tf_names,
        seed=seed,
        limit=limit,
        verbose=True
    )

    # Save results
    print(f"Saving network to {output_file}...")
    network.to_csv(output_file, sep='\t', index=False, header=False)

    print(f"Done! Network contains {len(network)} regulatory links.")
    print(f"\nTop 10 regulatory links:")
    print(network.head(10).to_string(index=False))


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description='Infer gene regulatory network using GRNBoost2'
    )
    parser.add_argument(
        'expression_file',
        help='Path to expression matrix (TSV format, genes as columns)'
    )
    parser.add_argument(
        'output_file',
        help='Path for output network (TSV format)'
    )
    parser.add_argument(
        '--tf-file',
        help='Path to transcription factors file (one per line)',
        default=None
    )
    parser.add_argument(
        '--seed',
        help='Random seed for reproducibility (default: 777)',
        type=int,
        default=777
    )
    parser.add_argument(
        '--limit',
        help='Return only the top N regulatory links',
        type=int,
        default=None
    )

    args = parser.parse_args()

    run_grn_inference(
        expression_file=args.expression_file,
        output_file=args.output_file,
        tf_file=args.tf_file,
        seed=args.seed,
        limit=args.limit
    )

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Pick Arboreto when you need ranked transcriptional regulators from expression matrices inside an agent workflow rather than general-purpose ML feature selection.

FAQ

What algorithms does Arboreto support for GRN inference?

Arboreto documents two multiple-regression GRN inference algorithms: GRNBoost2 (gradient boosting, recommended for large datasets) and GENIE3. Both train a per-target-gene model, rank important features as candidate regulators, and output importance scores.

When should developers choose GRNBoost2 over GENIE3?

Arboreto recommends GRNBoost2 for large-scale expression datasets where faster gradient-boosting inference matters. GENIE3 follows the same three-step regression strategy but differs in underlying regression method and computational efficiency on big matrices.

Is Arboreto safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLanalyticspipelines

About

Arboreto by the numbers

Add your badge

How do you infer gene regulatory networks from expression data?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Arboreto

Overview

Quick Start

Core Capabilities

1. Basic GRN Inference

2. Algorithm Selection

3. Distributed Computing

Installation

Common Use Cases

Single-Cell RNA-seq Analysis

Bulk RNA-seq with TF Filtering

Comparative Analysis (Multiple Conditions)

Output Interpretation

Integration with pySCENIC

Reproducibility

Troubleshooting

GRN Inference Algorithms

Algorithm Overview

GRNBoost2 (Recommended)

When to Use

Technical Details

Usage

Parameters (grnboost2)

GENIE3

When to Use

Technical Details

Usage

Parameters (genie3)

Algorithm Comparison

Advanced: Custom Regressors with diy

Choosing the Right Algorithm

Basic GRN Inference with Arboreto

Input Data Requirements

Pandas DataFrame (Recommended)

NumPy Array

Sparse CSC Matrix (arboreto 0.1.6+)

Transcription Factors (TFs)

Basic Inference Workflow

Using Pandas DataFrame

Using NumPy Array

Output Format

Setting Random Seed

Limiting Output Size

Algorithm Selection

Distributed Computing with Arboreto

Computation Architecture

Local Multi-Core Processing (Default)

Custom Local Dask Client

Benefits of Custom Client

Multiple Inference Runs with Same Client

Distributed Cluster Computing

Step 1: Set Up Dask Scheduler (on cluster head node)

Step 2: Start Dask Workers (on cluster compute nodes)

Step 3: Connect from Client

Cluster Configuration Best Practices

Monitoring and Debugging

Dask Dashboard

Verbose Output

Performance Optimization Tips

1. Data Format

2. Worker Configuration

3. Cluster Setup

4. Transcription Factor Filtering

Example: Large-Scale Single-Cell Analysis

Related skills

How it compares

FAQ

What algorithms does Arboreto support for GRN inference?

When should developers choose GRNBoost2 over GENIE3?

Is Arboreto safe to install?

This week in AI coding

Parameters (`grnboost2`)

Parameters (`genie3`)

Advanced: Custom Regressors with `diy`