Codex Autoresearch Loop

Name: Codex Autoresearch Loop
Author: aradotso

aradotso/trending-skills

1.2k installs
66 repo stars
Updated July 9, 2026
aradotso/trending-skills

Codex Autoresearch is a Codex skill that runs an autonomous modify→verify→keep/revert loop on your codebase until a measurable goal is reached.

About

Codex Autoresearch is an agent skill that runs an autonomous modify→verify→keep/revert loop on codebases until a measurable goal is reached. Developers describe a single-sentence target; Codex infers scope, metrics, and verification commands, then iterates unattended with automatic git commits and reverts. Seven modes span loop optimization, planning, debugging, security audits, and release gating. Cross-run learning stores lessons; pivot protocol escalates via refinement and web search when stalled.

Seven inference modes (loop, plan, fix, debug, security, ship, exec) activate from natural language
Dual-gate verification: separate verify (metric improvement) and guard (regression detection) commands
Automatic escalation: 3+ discards → REFINE, 5+ → PIVOT, 2 PIVOTs → web search without user prompts
Autonomous code improvement loops for TypeScript types, test coverage, bundle size, and lint warnings
Autonomous code improvement loops for TypeScript types, test coverage, bundle size, and lint warnings

Codex Autoresearch Loop by the numbers

1,176 all-time installs (skills.sh)
+8 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #910 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: CRITICAL risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

codex-autoresearch-loop capabilities & compatibility

Capabilities: autonomous iteration · metric optimization · hypothesis testing · git automation · escalation protocol · cross session learning
Use cases: code review · testing · debugging · refactoring
Runs: Runs locally
Pricing: Free

From the docs

What codex-autoresearch-loop says it does

Self-directed iterative research skill for Codex that continuously cycles through modify, verify, retain or discard, and repeat until a measurable goal is reached.

skill:aradotso/trending-skills#codex-autoresearch-loop

Codex maps your sentence to one of seven modes automatically — you never pick a mode explicitly.

skill:aradotso/trending-skills#codex-autoresearch-loop

You are never asked for permission during escalation. The loop continues.

skill:aradotso/trending-skills#codex-autoresearch-loop

npx skills add https://github.com/aradotso/trending-skills --skill codex-autoresearch-loop

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/aradotso/trending-skills/codex-autoresearch-loop.svg)](https://skillselion.com/skills/aradotso/trending-skills/codex-autoresearch-loop)

Installs	1.2k
repo stars	★ 66
Security audit	0 / 3 scanners passed
Last updated	July 9, 2026
Repository	aradotso/trending-skills ↗

What it does

Autonomous code improvement loops for TypeScript types, test coverage, bundle size, and lint warnings

Who is it for?

Unattended metric optimization (coverage, bundle size, type errors, lint warnings) and evidence-driven debugging across long sessions.

Skip if: Manual code review, interactive pair programming, one-shot ad-hoc queries without measurable targets.

When should I use this skill?

You have a quantifiable code goal (e.g. 'eliminate all any types', 'reach 85% coverage'), enough context to write a verify command, and permission for unattended git commits.

What you get

A measurable improvement (lower type errors, higher coverage, smaller bundle) stacked in git history with automatic rollbacks on failure.

Retained code patches
Verification logs
Goal completion report

By the numbers

Seven inference modes: loop, plan, fix, debug, security, ship, exec
Up to five confirmation rounds before loop auto-proceeds
Parallel mode supports concurrent hypothesis testing (e.g. Parallel: 4)

Files

SKILL.mdMarkdownGitHub ↗

Codex Autoresearch

Skill by ara.so — Daily 2026 Skills collection.

Codex Autoresearch is a Codex skill that runs an autonomous modify→verify→keep/revert loop on your codebase. You describe a measurable goal in one sentence; Codex confirms the plan, then iterates unattended — every improvement stacks in git, every failure reverts automatically — until interrupted or a cap is reached. Inspired by Karpathy's autoresearch concept, generalized beyond ML training to any software metric.

---

Installation

Option A — manual copy into your project:

git clone https://github.com/leo-lilinxiao/codex-autoresearch.git
cp -r codex-autoresearch your-project/.agents/skills/codex-autoresearch

Option B — Codex skill installer:

$skill-installer install https://github.com/leo-lilinxiao/codex-autoresearch

The skill lives at .agents/skills/codex-autoresearch/ inside your project. No config file is required before first use.

---

How to Activate

Open Codex in your project directory and prefix your goal with $codex-autoresearch:

$codex-autoresearch
I want to get rid of all `any` types in my TypeScript code

Codex will: 1. Scan the repo and infer scope, metric, verify command, and guard command. 2. Present a confirmation summary — reply go (or correct anything). 3. Run the loop unattended until you interrupt it or the goal is met.

You never write config. Codex infers everything.

---

Confirmation Flow

Before the loop starts Codex always shows what it found and asks you to confirm. Example exchange:

Codex: I found 47 `any` occurrences across src/**/*.ts.

       Confirmed:
       - Target: eliminate `any` types in src/**/*.ts
       - Metric: `any` count (current: 47), direction: lower
       - Verify: grep + tsc --noEmit as guard

       Need to confirm:
       - Run until all gone, or cap at N iterations?

       Reply "go" to start, or tell me what to change.

You:   Go, run overnight.

Codex: Starting — baseline: 47. Iterating until interrupted.

Up to five confirmation rounds are possible. After that, Codex proceeds.

---

The Loop (internals)

PHASE 0: Probe environment (CPU/GPU/RAM/toolchains), check for session resume
PHASE 1: Read context + lessons file from prior run (if any)

LOOP (forever or N times):
  1. Review current state, git history, results log, lessons
  2. Pick ONE hypothesis (apply perspectives, filter by environment)
     -- or N hypotheses if parallel mode is active
  3. Make ONE atomic change
  4. git commit (before verification)
  5. Run verify command  →  did the target metric improve?
     Run guard command   →  did anything else break?
  6. Improved → keep (extract lesson)
     Worse    → approved rollback strategy (git revert)
     Crashed  → fix or skip
  7. Log the result to results log
  8. Health check (disk, git, verify health)
  9. If 3+ discards → REFINE; 5+ → PIVOT; 2 PIVOTs → web search
 10. Repeat. Never stop. Never ask.

The loop runs unbounded unless you say Iterations: N during confirmation.

---

Dual-Gate Verification

Two commands serve distinct purposes:

Gate	Purpose	Fails means
Verify	Did the target metric improve?	Change discarded, reverted
Guard	Did anything else break?	Change reworked (up to 2 attempts), then reverted

Guard files are never modified by the loop.

Example verify + guard pair for a Python coverage run:

Verify: pytest --cov=src --cov-report=term 2>&1 | grep TOTAL | awk '{print $NF}'
Guard:  python -m mypy src --ignore-missing-imports

Example for TypeScript type cleanup:

Verify: grep -r "any" src --include="*.ts" | wc -l
Guard:  npx tsc --noEmit

---

Modes

Codex maps your sentence to one of seven modes automatically — you never pick a mode explicitly.

`loop` — iterate toward a measurable target (default)

$codex-autoresearch
Improve test coverage in src/ to at least 80%

$codex-autoresearch
Reduce bundle size — it's currently 2.3 MB, get it under 1 MB

`plan` — turn a vague goal into a validated loop config

$codex-autoresearch
I want to make our API faster but I don't know where to start

Codex will interview you (p95 latency vs throughput? which endpoint?) and produce a ready-to-run loop config.

`fix` — repair errors until count reaches zero

$codex-autoresearch
pytest is failing, 12 tests broken after the refactor — fix them all

`debug` — evidence-driven root-cause hunting

$codex-autoresearch
Our API returns 503 randomly under load, no idea why

Each iteration tests one falsifiable hypothesis. Codex presents evidence, not guesses.

`security` — read-only STRIDE + OWASP audit

$codex-autoresearch
Is this code secure?

`ship` — readiness verification and release gating

$codex-autoresearch
Ship it

`exec` — one-shot execution with no loop

$codex-autoresearch
Run the benchmark suite and summarize results

---

Inline Configuration (optional)

You can override defaults inline during the confirmation step — no file edits needed:

Phrase	Effect
`Iterations: 20`	Cap the loop at 20 iterations
`Parallel: 3`	Test 3 hypotheses concurrently per round
`Guard: npm test`	Override the inferred guard command
`Verify: <command>`	Override the inferred verify command
`Scope: src/api/`	Restrict changes to a subdirectory

Example during confirmation:

You:   Go. Iterations: 30, Guard: npm test, Scope: src/api/

---

Cross-Run Learning

At the end of each iteration Codex writes a structured lesson to .agents/skills/codex-autoresearch/lessons.md:

Iteration 7 — KEPT
Hypothesis: replace explicit `any` with inferred generic in src/utils/mapper.ts
Change: added <T extends Record<string, unknown>> to mapKeys()
Result: any count 31 → 29
Lesson: Generic constraints on utility functions eliminate clusters of `any` downstream.

On session resume Codex reads this file first. Each new run benefits from prior runs.

To resume an interrupted run:

$codex-autoresearch
Resume

Codex re-reads the lessons file, checks git state, re-establishes the baseline, and continues.

---

Parallel Experiments

Request parallel mode during confirmation or at any time:

You:   Go, parallel 4

Codex runs four hypotheses concurrently, keeps the best result, discards the rest. Useful when hypothesis space is large.

---

Pivot Protocol

If the loop stalls, escalation happens automatically:

Consecutive discards	Action
3	REFINE — narrow hypothesis, try smaller atomic changes
5	PIVOT — change strategy entirely
2 PIVOTs	Web search — Codex fetches external references to unstick itself

You are never asked for permission during escalation. The loop continues.

---

Real Code Examples

Example 1 — TypeScript `any` elimination (Python verify script)

If you want a custom verify script instead of a one-liner:

# scripts/count_any.py
import subprocess, sys

result = subprocess.run(
    ["grep", "-r", "--include=*.ts", r"\bany\b", "src/"],
    capture_output=True, text=True
)
count = len(result.stdout.strip().splitlines())
print(count)
sys.exit(0)  # always exit 0; the number is what matters

Tell Codex during confirmation:

Verify: python scripts/count_any.py
Guard:  npx tsc --noEmit

Example 2 — pytest coverage loop (Python)

# scripts/coverage_pct.py
import subprocess, re, sys

out = subprocess.check_output(
    ["pytest", "--cov=src", "--cov-report=term", "-q"],
    stderr=subprocess.STDOUT, text=True
)
match = re.search(r"TOTAL\s+\d+\s+\d+\s+(\d+)%", out)
if match:
    print(int(match.group(1)))
    sys.exit(0)
print(0)
sys.exit(0)

$codex-autoresearch
Improve test coverage — target 85%

Verify: python scripts/coverage_pct.py
Guard:  python -m mypy src
Direction: higher
Target: 85
Iterations: 50

Example 3 — bundle size loop (Node.js project)

# scripts/bundle_size.sh
#!/usr/bin/env bash
npm run build --silent 2>/dev/null
du -k dist/bundle.js | awk '{print $1}'

$codex-autoresearch
Reduce our JS bundle size, currently ~2300 KB, target under 900 KB

Verify: bash scripts/bundle_size.sh
Guard:  npm test
Direction: lower
Target: 900

Example 4 — lint warning count (any language)

# scripts/lint_count.sh
#!/usr/bin/env bash
npx eslint src/ --format json 2>/dev/null \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print(sum(len(f['messages']) for f in d))"

$codex-autoresearch
Get our ESLint warning count to zero

Verify: bash scripts/lint_count.sh
Direction: lower
Target: 0

---

Unattended Runs

For overnight or long runs, ensure Codex CLI approval settings do not interrupt git commit or git revert commands. The simplest option is to run in a disposable or sandboxed repo clone:

git clone . /tmp/autoresearch-sandbox
cd /tmp/autoresearch-sandbox
# launch Codex here with full permissions

Results accumulate in git history. Pull the winning commits back to your main repo when done:

# in your main repo
git fetch /tmp/autoresearch-sandbox main
git cherry-pick <winning-commit-sha>

---

Session Artifacts

File	Contents
`.agents/skills/codex-autoresearch/lessons.md`	Structured lessons from every iteration
`.agents/skills/codex-autoresearch/results.log`	Full per-iteration log (metric value, kept/reverted, elapsed)
`.agents/skills/codex-autoresearch/session.json`	Current session state for resume

These files persist across Codex sessions. Delete them to start fresh.

---

Troubleshooting

Loop reverts every change:

Verify command may be returning a non-numeric value. Test it manually: bash -c "<your verify command>" should print a single number.
Metric direction may be wrong. Confirm Direction: lower or Direction: higher during setup.

Guard fires on unrelated files:

Narrow scope: Scope: src/specific-module/
Or tell Codex explicitly: Do not touch tests/ during confirmation.

Session resume picks up wrong baseline:

Delete session.json to force a fresh baseline: rm .agents/skills/codex-autoresearch/session.json

Parallel mode produces merge conflicts:

Codex handles this internally via the pivot protocol, but if it gets stuck, reduce parallelism: Parallel: 2

Codex asks questions mid-loop:

This means a guard crash produced ambiguous output. Pre-empt it by specifying Guard: <command> || true if guard failures should be non-fatal, or by giving Codex fuller sandbox permissions so it can run git commands freely.

Loop hits PIVOT but makes no progress:

Supply a seed hypothesis during confirmation: Hint: try tree-shaking unused imports first
Or run plan mode first to produce a richer hypothesis list before switching to loop.

---

Quick Reference

# Start a loop
$codex-autoresearch
<your goal in one sentence>

# Resume interrupted run
$codex-autoresearch
Resume

# Bounded run
$codex-autoresearch
<goal> — Iterations: 25

# Parallel hypotheses
$codex-autoresearch
<goal> — Parallel: 4

# Force a mode
$codex-autoresearch fix
pytest has 8 failures, repair them

# Read-only audit
$codex-autoresearch security
Audit src/api/ for injection vulnerabilities

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

Pick codex-autoresearch-loop over single-shot codegen skills when Codex must autonomously iterate toward a verifiable metric without constant supervision.

FAQ

What happens if Codex makes a bad change?

The verify command fails; Codex automatically reverts via git revert. Guard command runs after each change; if it fails, Codex reworks the change up to 2 times before reverting. Failed attempts do not block the loop.

How does Codex know what to verify and guard?

Codex scans your repo, interviews you during a confirmation flow (up to 5 rounds), and infers the verify command (measure target metric) and guard command (detect regressions). You can override both inline: 'Verify: <cmd>', 'Guard: <cmd>'.

Can I resume an interrupted run?

Yes. Run '$codex-autoresearch Resume' and Codex re-reads lessons.md, checks git state, re-establishes the baseline, and continues. Session artifacts (lessons.md, results.log, session.json) persist across Codex sessions.

Is Codex Autoresearch Loop safe to install?

skills.sh reports 0 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingbackendtestingdevops

About

Codex Autoresearch Loop by the numbers

codex-autoresearch-loop capabilities & compatibility

What codex-autoresearch-loop says it does

Add your badge

What it does

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Codex Autoresearch

Installation

How to Activate

Confirmation Flow

The Loop (internals)

Dual-Gate Verification

Modes

loop — iterate toward a measurable target (default)

plan — turn a vague goal into a validated loop config

fix — repair errors until count reaches zero

debug — evidence-driven root-cause hunting

security — read-only STRIDE + OWASP audit

ship — readiness verification and release gating

exec — one-shot execution with no loop

Inline Configuration (optional)

Cross-Run Learning

Parallel Experiments

Pivot Protocol

Real Code Examples

Example 1 — TypeScript any elimination (Python verify script)

Example 2 — pytest coverage loop (Python)

Example 3 — bundle size loop (Node.js project)

Example 4 — lint warning count (any language)

Unattended Runs

Session Artifacts

Troubleshooting

Quick Reference

Related skills

How it compares

FAQ

What happens if Codex makes a bad change?

How does Codex know what to verify and guard?

Can I resume an interrupted run?

Is Codex Autoresearch Loop safe to install?

This week in AI coding

`loop` — iterate toward a measurable target (default)

`plan` — turn a vague goal into a validated loop config

`fix` — repair errors until count reaches zero

`debug` — evidence-driven root-cause hunting

`security` — read-only STRIDE + OWASP audit

`ship` — readiness verification and release gating

`exec` — one-shot execution with no loop

Example 1 — TypeScript `any` elimination (Python verify script)