
Arboreto
Infer gene regulatory networks from expression matrices using GRNBoost2 or GENIE3 when you are doing computational biology or single-cell analysis from the terminal with an agent.
Overview
Arboreto is an agent skill for the Idea phase that documents GRNBoost2 and GENIE3 GRN inference workflows for expression data and TF lists.
Install
npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill arboretoWhat is this skill?
- Documents two GRN algorithms—GRNBoost2 (gradient boosting, default for large data) and GENIE3 (random forests)—with the
- Recommends GRNBoost2 for large-scale and time-constrained work (e.g. single-cell RNA-seq with tens of thousands of obser
- Shared three-step strategy: per-target regression, important-feature regulators, scored TF–target–importance triplets
- Includes `grnboost2()` signature and parameters (`expression_data`, `tf_names`, `seed`, `limit`, sparse matrix support)
- Explains when to pick GENIE3 vs GRNBoost2 based on dataset size and runtime constraints
- Two high-level GRN algorithms: GRNBoost2 and GENIE3
- Three-step inference strategy shared by both algorithms
- GRNBoost2 targets datasets with tens of thousands of observations
Adoption & trust: 514 installs on skills.sh; 27.6k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have bulk or single-cell expression data and transcription-factor names but no clear, library-correct way to infer regulator–target networks at scale.
Who is it for?
Solo builders or tiny teams running Python GRN inference on RNA-seq or single-cell matrices who want GRNBoost2 as the default for large datasets.
Skip if: Builders without expression matrices, non-gene-expression domains, or teams that need a one-click SaaS dashboard instead of Python library workflows.
When should I use this skill?
User needs GRN inference with Arboreto, GRNBoost2/GENIE3 selection, or Python examples for expression data and TF regulators.
What do I get? / Deliverables
You can call `grnboost2` or GENIE3 with the right data types and parameters and interpret TF–target–importance triplets for downstream validation or visualization.
- Correct `grnboost2` or GENIE3 invocation for the dataset scale
- Algorithm recommendation (GRNBoost2 vs GENIE3) with rationale
- Expected TF–target–importance style network output description
Recommended Skills
Journey fit
GRN inference is an upstream research step—turning expression data into regulator hypotheses before you build pipelines, validate models, or publish results. Fits the research subphase because the skill documents algorithm choice, parameters, and Python usage for exploratory regulatory-network analysis rather than shipping product code.
How it compares
Use for documented Arboreto API usage and algorithm choice—not as a generic statistics tutor or a wet-lab protocol skill.
Common Questions / FAQ
Who is arboreto for?
Computational biology solo builders and indie researchers who use Python agents to run GRN inference with the Arboreto library on real expression data.
When should I use arboreto?
During Idea/research when you are comparing GRNBoost2 vs GENIE3, wiring `grnboost2()` parameters, or explaining regulator scoring to an agent before validation or pipeline work.
Is arboreto safe to install?
Treat it like any third-party scientific skill: review the Security Audits panel on this Prism page and verify the skill source and dependencies before running on sensitive data.
SKILL.md
READMESKILL.md - Arboreto
# GRN Inference Algorithms Arboreto provides two high-level algorithms for gene regulatory network (GRN) inference, both based on the multiple regression approach. ## Algorithm Overview Both algorithms follow the same inference strategy: 1. For each target gene in the dataset, train a regression model 2. Identify the most important features (potential regulators) from the model 3. Emit these features as candidate regulators with importance scores The key difference is **computational efficiency** and the underlying regression method. ## GRNBoost2 (Recommended) **Purpose**: Fast GRN inference for large-scale datasets using gradient boosting. ### When to Use - **Large datasets**: Tens of thousands of observations (e.g., single-cell RNA-seq) - **Time-constrained analysis**: Need faster results than GENIE3 - **Default choice**: GRNBoost2 is the flagship algorithm and recommended for most use cases ### Technical Details - **Method**: Stochastic gradient boosting with early-stopping regularization - **Performance**: Significantly faster than GENIE3 on large datasets - **Output**: Same format as GENIE3 (TF-target-importance triplets) ### Usage ```python from arboreto.algo import grnboost2 network = grnboost2( expression_data=expression_matrix, tf_names=tf_names, seed=42, limit=5000, ) ``` ### Parameters (`grnboost2`) ```python grnboost2( expression_data, # DataFrame, ndarray, or scipy.sparse.csc_matrix gene_names=None, # Required for ndarray/sparse inputs tf_names='all', # TF list, None/'all' → all genes as regulators client_or_address='local', # 'local', scheduler address, or Dask Client early_stop_window_length=25, # Early-stopping window (GRNBoost2 only) limit=None, # Return top N links globally seed=None, # Random seed; None = non-deterministic verbose=False, ) ``` ## GENIE3 **Purpose**: Classic Random Forest-based GRN inference, serving as the conceptual blueprint. ### When to Use - **Smaller datasets**: When dataset size allows for longer computation - **Comparison studies**: When comparing with published GENIE3 results - **Validation**: To validate GRNBoost2 results ### Technical Details - **Method**: Random Forest regression (ExtraTrees available via `diy`) - **Foundation**: Original multiple regression GRN inference strategy - **Trade-off**: More computationally expensive but well-established ### Usage ```python from arboreto.algo import genie3 network = genie3( expression_data=expression_matrix, tf_names=tf_names, seed=42, ) ``` ### Parameters (`genie3`) ```python genie3( expression_data, gene_names=None, tf_names='all', client_or_address='local', limit=None, seed=None, verbose=False, ) ``` ## Algorithm Comparison | Feature | GRNBoost2 | GENIE3 | |---------|-----------|--------| | **Speed** | Fast (optimized for large data) | Slower | | **Method** | Gradient boosting (GBM) | Random Forest | | **Best for** | Large-scale data (10k+ observations) | Small-medium datasets | | **Output format** | Same | Same | | **Inference strategy** | Multiple regression | Multiple regression | | **Recommended** | Yes (default choice) | For comparison/validation | | **Early stopping** | Yes (`early_stop_window_length`) | No | ## Advanced: Custom Regressors with `diy` For custom scikit-learn regressor settings, use `diy()` (not `grnboost2`/`genie3` kwargs): ```python from arboreto.algo import diy from arboreto.core import SGBM_KWARGS, RF_KWARGS # Custom GRNBoost2-style run custom_gbm = diy( expression_data=expression_matrix, regressor_type='GBM', # 'RF', 'GBM', or 'ET' regressor_kwargs={ **SGBM_KWARGS, 'n_estimators': 100, 'max_depth': 5, 'learning_rate': 0.1, }, tf_names=tf_names, seed=42, ) # Custom GENIE3-style run custom_rf = diy( expression_data=expression_matrix, regressor_ty