
Explore Run
Run authorized exploratory ML variants in isolation, favoring small subsets and short cycles, then surface TOP_RUNS for human review—not auto-promoted baselines.
Overview
Explore-run is an agent skill for the Validate phase that plans and records isolated exploratory experiment runs with conservative budgets and human-review summaries.
Install
npx skills add https://github.com/lllllllama/rigorpilot-skills --skill explore-runWhat is this skill?
- Runs only when exploratory execution is explicitly authorized
- Isolates runs from trusted baseline; includes budget-aware variant matrix generator script
- Prefers small-subset or short-cycle checks before heavier exploratory jobs
- Writes CHANGESET.md, TOP_RUNS.md, and status.json into explore_outputs
- Avoids rewriting training logic or auto-promoting exploratory results to trusted lane
- 3 default selection weights (cost 0.25, success_rate 0.35, expected_gain 0.40)
- 3 required explore artifacts: CHANGESET.md, TOP_RUNS.md, status.json
Adoption & trust: 32.3k installs on skills.sh; 412 GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You want to try hyperparameter or config variants but risk corrupting your trusted reproduction baseline or burning GPU budget blindly.
Who is it for?
Builders who already have a audited minimal run and explicit authorization to explore improvements on a research codebase.
Skip if: First-time reproduction attempts, implicit A/B tweaks without authorization, or replacing full training orchestration inside the agent.
When should I use this skill?
Plan isolated exploratory runs conservatively, prefer small-subset or short-cycle checks first, and write CHANGESET.md TOP_RUNS.md and status.json into explore_outputs.
What do I get? / Deliverables
You get isolated explore_outputs artifacts (CHANGESET.md, TOP_RUNS.md, status.json) listing candidates for review without auto-promoting them to the trusted lane.
- explore_outputs/CHANGESET.md
- explore_outputs/TOP_RUNS.md
- explore_outputs/status.json with current_research and variant metadata
Recommended Skills
Journey fit
Exploration is explicitly scoped to validating hypotheses on a reproduction fork before you ship trusted metrics. Prototype covers isolated experiment branches and budget-aware variant matrices away from the trusted baseline lane.
How it compares
Use instead of ad-hoc multi-run sweeps that overwrite baseline checkpoints or skip written TOP_RUNS evidence.
Common Questions / FAQ
Who is explore-run for?
Solo ML researchers and agent operators using RigorPilot who need governed exploratory lanes after a trusted sanity path exists.
When should I use explore-run?
In validate prototype work when leadership or your own gate explicitly authorizes exploration; not during initial README reproduction or routine ship-only smoke tests.
Is explore-run safe to install?
Check this page’s Security Audits panel; exploratory runs can invoke shell, network, and heavy compute—approve budgets and isolation rules before running.
Workflow Chain
Requires first: minimal run and audit
SKILL.md
READMESKILL.md - Explore Run
display_name: Rigor Improve / Rigor Explore short_description: Rigor Improve / Rigor Explore run leaf mode for isolated exploratory experiment runs. default_prompt: Plan isolated exploratory runs conservatively, prefer small-subset or short-cycle checks first, and write CHANGESET.md TOP_RUNS.md and status.json into explore_outputs. # Rigor Improve / Rigor Explore Run Policy ## Purpose Use this skill only when exploratory execution has been explicitly authorized. ## Requirements - keep experiment runs isolated from the trusted baseline - prefer small-subset or short-cycle checks before heavier exploratory runs - record `current_research`, experiment branch, variant count, and top runs - summarize candidates for human review instead of claiming trusted success ## Avoid - default or implicit exploration - rewriting training logic inside this skill - promoting exploratory results into the trusted lane automatically - using this skill as the end-to-end `current_research` explore orchestrator #!/usr/bin/env python3 """Generate a budget-aware exploratory variant matrix for isolated runs.""" from __future__ import annotations import argparse import itertools import json from pathlib import Path from typing import Any, Dict, List, Sequence DEFAULT_SELECTION_WEIGHTS = { "cost": 0.25, "success_rate": 0.35, "expected_gain": 0.40, } def load_spec(path: Path) -> Dict[str, Any]: return json.loads(path.read_text(encoding="utf-8-sig")) def current_research_value(spec: Dict[str, Any]) -> str: return str(spec.get("current_research") or spec.get("baseline_ref") or "unknown") def normalize_metric_goal(value: Any) -> str: text = str(value or "maximize").strip().lower() if text in {"min", "minimize", "lower", "lower_is_better"}: return "minimize" return "maximize" def safe_float(value: Any) -> float: if value is None: return 0.0 if isinstance(value, (int, float)): return float(value) try: return float(str(value)) except ValueError: return 0.0 def clamp_score(value: float) -> float: return max(0.0, min(1.0, value)) def can_float(value: Any) -> bool: try: float(str(value)) return True except (TypeError, ValueError): return False def unique_preserving_order(values: Sequence[Any]) -> List[Any]: ordered: List[Any] = [] for item in values: if item not in ordered: ordered.append(item) return ordered def rank_lookup(values: Sequence[Any]) -> Dict[Any, int]: ordered_values = unique_preserving_order(values) if not ordered_values: return {} has_none = any(item is None for item in ordered_values) non_none = [item for item in ordered_values if item is not None] if non_none and all(can_float(item) for item in non_none): ordered_non_none = sorted(non_none, key=lambda item: (safe_float(item), str(item))) else: ordered_non_none = non_none ordered = ([None] if has_none else []) + ordered_non_none return {item: index for index, item in enumerate(ordered)} def normalized_lookup_score(value: Any, lookup: Dict[Any, int]) -> float: if not lookup: return 0.0 index = lookup.get(value, 0) max_index = max(lookup.values(), default=0) if max_index <= 0: return 0.0 return index / max_index def normalize_weights(spec: Dict[str, Any]) -> Dict[str, float]: raw = dict(DEFAULT_SELECTION_WEIGHTS) raw.update(spec.get("selection_weights", {})) total = sum(max(0.0, safe_float(value)) for value in raw.values()) if total <= 0: return dict(DEFAULT_SELECTION_WEIGHTS) return { "cost": max(0.0, safe_float(raw.get("cost"))) / total, "success_rate": max(0.0, safe_float(raw.get("success_rate"))) / total, "expected_gain": max(0.0, safe_float(raw.get("expected_gain"))) / total, } def axis_aggressiveness_score(axis_values: Dict[str, Any], axes: Dict[str, Sequence[A