Evolving Ai Agents

Name: Evolving Ai Agents
Author: orchestra-research

orchestra-research/ai-research-skills

336 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

evolving-ai-agents is an agent skill that documents the A-Evolve `Evolver` API for benchmark-driven agent evolution.

About

evolving-ai-agents is a reference skill for Orchestra’s A-Evolve stack: import `agent_evolve as ae`, construct an `Evolver` with an agent seed or custom workspace, attach a named or custom benchmark, and iterate evolution cycles until benchmarks improve. Solo builders shipping autonomous coding agents use it when ad-hoc prompt tweaks stop scaling and they need a structured loop—tasks, scoring, and evolvable layers—grounded in public harnesses like SWE-verified or MCP-Atlas. The doc spells resolution rules for string agent names, working-directory copies, and manifest validation so you do not start evolution on a broken workspace. You can override `workspace_dir`, inject a custom engine, and thread `EvolveConfig` without re-reading the whole Python package. Complexity is advanced because you are orchestrating benchmarks, seeds, and evolution state—not invoking a single API call. Treat this as Build-phase agent infrastructure; pair it with your own eval harness and version control before Ship.

`ae.Evolver` entry point with `run(cycles)` returning `EvolutionResult`
Built-in agent seeds: swe, terminal, mcp—or custom `BaseAgent` / workspace paths
Built-in benchmarks: swe-verified, mcp-atlas, terminal2, skill-bench, arc-agi-3 via `BenchmarkAdapter`
Seed workspaces copied to a working directory with manifest checks for `entrypoint` and `evolvable_layers`
Custom `EvolutionEngine` hook (default `AEvolveEngine`) and `EvolveConfig` tuning

Evolving Ai Agents by the numbers

336 all-time installs (skills.sh)
+37 installs in the week ending Jul 26, 2026 (Skillselion tracking)
Ranked #2,131 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill evolving-ai-agents

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/evolving-ai-agents.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/evolving-ai-agents)

Installs	336
repo stars	★ 11.2k
Security audit	1 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

What it does

Run benchmark-driven evolution cycles on coding or terminal agents with the A-Evolve `agent_evolve` API and built-in seeds like swe, terminal, and mcp.

Who is it for?

Best when you're experimenting with self-improving agents and already run Python and want named benchmarks plus manifest-validated seed workspaces.

Skip if: Beginners who only need a one-shot Claude skill without local Python evolution infrastructure or benchmark maintenance.

When should I use this skill?

User is implementing or debugging A-Evolve / agent_evolve Evolver setup, seeds, benchmarks, or evolution cycles.

What you get

You can configure `ae.Evolver` with seeds, benchmarks, and cycles so evolved workspace state reflects measurable benchmark gains instead of guesswork.

Configured Evolver run
EvolutionResult after benchmark cycles
Validated agent workspace with entrypoint and evolvable_layers

By the numbers

5 built-in benchmark names documented
3 built-in agent seed names: swe, terminal, mcp

Files

SKILL.mdMarkdownGitHub ↗

Evolving AI Agents with A-Evolve

Overview

A-Evolve is universal infrastructure for evolving any AI agent across any domain using any evolution algorithm with zero manual engineering. It represents all evolvable agent state as files (prompts, skills, memory, tools), runs iterative solve-observe-evolve cycles against benchmarks, and uses LLM-driven mutation to improve agent performance automatically.

Benchmark results (Claude Opus 4.6):

MCP-Atlas: 79.4% (#1)
SWE-bench Verified: 76.8% (~#5)
Terminal-Bench 2.0: 76.5% (~#7)
SkillsBench: 34.9% (#2)

When to Use A-Evolve

Use A-Evolve when:

Optimizing agent prompts, skills, or memory against a measurable benchmark
Building self-improving agents with automated gating and rollback
Evolving domain-specific tool usage and procedures through LLM-driven mutation
Running iterative solve-observe-evolve loops to maximize agent performance
Needing reproducible, git-versioned evolution history for every change

Key differentiator: Other frameworks _build_ agents; A-Evolve _optimizes_ them. It sits on top of any agent framework and makes it better through automated evolution.

Do NOT use A-Evolve for:

Building multi-agent orchestration from scratch (use CrewAI, LangGraph)
One-shot agent tasks with no iteration needed (use LangChain, LlamaIndex)
RAG pipeline optimization (use LlamaIndex, Chroma)
Prompt-only optimization without skill/memory evolution (use DSPy)

Quick Start

Installation

pip install a-evolve                    # Core
pip install a-evolve[anthropic]         # With Claude support
pip install a-evolve[all]               # All providers

Three-Line Evolution

import agent_evolve as ae

evolver = ae.Evolver(agent="swe", benchmark="swe-verified")
results = evolver.run(cycles=10)
print(f"Final score: {results.final_score}")

This copies the built-in SWE seed workspace, runs 10 evolution cycles against SWE-bench Verified, and returns the optimized agent.

Core Concepts

The Agent Workspace

All evolvable state lives as files in a workspace directory:

my-agent/
├── manifest.yaml          # Metadata + entrypoint
├── prompts/
│   ├── system.md          # Main system prompt (evolved)
│   └── fragments/         # Modular prompt pieces
├── skills/
│   └── skill-name/
│       └── SKILL.md       # Reusable procedure with frontmatter
├── memory/
│   ├── episodic.jsonl     # Lessons from failures
│   └── semantic.jsonl     # General knowledge
├── tools/
│   ├── registry.yaml      # Tool manifest
│   └── tool_name.py       # Tool implementations
└── evolution/             # Managed by engine (metrics, history)

The Evolution Loop

Each cycle follows five phases:

1. Solve — Agent processes a batch of tasks from the benchmark 2. Observe — Benchmark evaluates trajectories, producing (task, trajectory, feedback) triples 3. Evolve — Evolution engine mutates workspace files based on observations 4. Gate — Validate mutations (git snapshot before/after for rollback) 5. Reload — Agent reinitializes from evolved filesystem state

Three Pluggable Interfaces

# 1. Agent — implements solve()
class MyAgent(ae.BaseAgent):
    def solve(self, task: ae.Task) -> ae.Trajectory:
        # Domain-specific solving logic
        return ae.Trajectory(task_id=task.id, output=result, steps=steps)

# 2. Benchmark — implements get_tasks() and evaluate()
class MyBenchmark(ae.BenchmarkAdapter):
    def get_tasks(self, split="train", limit=None) -> list[ae.Task]:
        return [ae.Task(id="1", input="...")]

    def evaluate(self, task: ae.Task, trajectory: ae.Trajectory) -> ae.Feedback:
        return ae.Feedback(success=True, score=0.95, detail="Passed")

# 3. Engine — implements step()
class MyEngine(ae.EvolutionEngine):
    def step(self, workspace, observations, history, trial):
        # Mutate workspace based on observations
        return ae.StepResult(mutated=True, summary="Updated prompts")

Workflow 1: Evolve an Existing Agent

Use when: You have a working agent and want to optimize it against a benchmark.

Critical Requirements:

[ ] Agent implements BaseAgent.solve() returning Trajectory
[ ] Benchmark implements BenchmarkAdapter with get_tasks() and evaluate()
[ ] Seed workspace has manifest.yaml with entrypoint and evolvable layers
[ ] System prompt exists at prompts/system.md
[ ] Workspace is a git repo (run git init && git add -A && git commit -m "init")

Steps

import agent_evolve as ae

# Configure evolution parameters
config = ae.EvolveConfig(
    batch_size=10,           # Tasks per solve round
    max_cycles=20,           # Maximum evolution iterations
    evolve_prompts=True,     # Mutate system prompt
    evolve_skills=True,      # Discover and refine skills
    evolve_memory=True,      # Build episodic memory
    evolver_model="us.anthropic.claude-opus-4-6-v1",
)

# Point to your agent workspace and benchmark
evolver = ae.Evolver(
    agent="./my-agent-workspace",
    benchmark="swe-verified",     # Or custom BenchmarkAdapter instance
    config=config,
)

# Run evolution
results = evolver.run(cycles=10)

# Inspect results
print(f"Cycles completed: {results.cycles_completed}")
print(f"Final score: {results.final_score}")
print(f"Converged: {results.converged}")
for cycle_num, score in enumerate(results.score_history):
    print(f"  Cycle {cycle_num + 1}: {score:.3f}")

Post-Evolution

The workspace is now optimized. Inspect what changed:

cd my-agent-workspace
git log --oneline              # See evo-1, evo-2, ... tags
git diff evo-1 evo-10          # Compare first and last evolution
cat prompts/system.md          # Read evolved prompt
ls skills/                     # See discovered skills

Workflow 2: Add a Custom Benchmark

Use when: You want to evolve agents on your own domain-specific tasks.

Critical Requirements:

[ ] Define task format (inputs, expected outputs)
[ ] Implement scoring logic (0.0–1.0 scale)
[ ] Prepare task dataset (train + holdout split)

Steps

import agent_evolve as ae

class CodeReviewBenchmark(ae.BenchmarkAdapter):
    """Evaluate agents on code review quality."""

    def get_tasks(self, split="train", limit=None):
        tasks = load_review_dataset(split)
        if limit:
            tasks = tasks[:limit]
        return [
            ae.Task(id=t["id"], input=t["diff"], metadata={"expected": t["comments"]})
            for t in tasks
        ]

    def evaluate(self, task, trajectory):
        expected = task.metadata["expected"]
        actual = trajectory.output
        precision, recall = compute_review_metrics(expected, actual)
        f1 = 2 * precision * recall / (precision + recall + 1e-9)
        return ae.Feedback(
            success=f1 > 0.7,
            score=f1,
            detail=f"P={precision:.2f} R={recall:.2f} F1={f1:.2f}",
        )

# Use with any agent
evolver = ae.Evolver(agent="./my-agent", benchmark=CodeReviewBenchmark())
results = evolver.run(cycles=5)

Workflow 3: Create a Custom Evolution Engine

Use when: The default LLM-driven mutation doesn't suit your domain.

Steps

import agent_evolve as ae

class RuleBasedEngine(ae.EvolutionEngine):
    def step(self, workspace, observations, history, trial):
        failures = [o for o in observations if not o.feedback.success]
        if not failures:
            return ae.StepResult(mutated=False, summary="No failures to address")

        # Analyze failure patterns
        error_types = categorize_errors(failures)
        prompt = workspace.read_prompt()

        # Append learned rules to prompt
        new_rules = generate_rules(error_types)
        workspace.write_prompt(prompt + "\n" + new_rules)

        return ae.StepResult(
            mutated=True,
            summary=f"Added {len(new_rules)} rules from {len(failures)} failures",
        )

evolver = ae.Evolver(
    agent="./my-agent",
    benchmark="my-benchmark",
    engine=RuleBasedEngine(),
)

Built-in Components

Seed Agents

Agent	Domain	Model	Key Feature
`swe`	SWE-bench	Claude Opus 4.6	Verify-fix loop, skill proposals
`terminal`	Terminal-Bench	Claude Sonnet 4	Concurrent timeout, env discovery
`mcp`	MCP-Atlas	Claude Opus 4.6	MCP server integration

Benchmarks

Name	Domain	Metric
`swe-verified`	Code patching	Pass rate
`mcp-atlas`	Tool calling	Accuracy
`terminal2`	Shell tasks	Pass rate
`skill-bench`	Multi-step procedures	Accuracy
`arc-agi-3`	Interactive games	RHAE score

Evolution Algorithms

Algorithm	Strategy	Best For
A-Evolve/SkillForge	LLM-driven workspace mutation	General-purpose
Guided Synthesis	Memory-first, curated skills	Skill discovery
Adaptive Evolution	Reward tracking, filtered observations	Fine-grained control
Adaptive Skill	Skill-centric refinement	Skill-heavy domains

Configuration Reference

ae.EvolveConfig(
    batch_size=10,              # Tasks per solve round
    max_cycles=20,              # Max evolution iterations
    holdout_ratio=0.2,          # Test set split for gating
    evolve_prompts=True,        # Mutate system prompts
    evolve_skills=True,         # Discover/refine skills
    evolve_memory=True,         # Build episodic memory
    evolve_tools=False,         # Mutate tool implementations
    trajectory_only=False,      # Hide scores from evolver
    evolver_model="us.anthropic.claude-opus-4-6-v1",
    evolver_max_tokens=16384,
    egl_threshold=0.05,         # Convergence epsilon
    egl_window=3,               # Cycles for plateau detection
)

Convergence: Evolution stops early when score improvement is less than egl_threshold over the last egl_window cycles.

Skill Format

Skills are reusable procedures discovered and refined during evolution:

---
name: verify-edge-cases
description: "TRIGGER when: checking boundary conditions. DO NOT TRIGGER: for happy-path tests."
---

## Pattern
Test all falsy-but-valid values: 0, False, "", [], {}

## Process
1. List all input boundaries
2. Run each against the implementation
3. Check both output AND side effects

Skills accumulate in the workspace skills/ directory. The evolver curates them: ACCEPT new skills, MERGE overlapping ones, SKIP redundant proposals. Target: 5–10 broad skills, not 30 narrow ones.

Common Issues

Evolution score plateaus early

Cause: Batch size too small or evolver doesn't see enough failure diversity. Fix: Increase batch_size (try 15–20) and ensure benchmark tasks cover diverse failure modes. Set trajectory_only=False so the evolver sees scores.

Agent workspace grows too large

Cause: Skill library bloat from accepting every proposal. Fix: The default SkillForge engine curates skills automatically. If using a custom engine, implement merging logic to consolidate overlapping skills.

Git conflicts during evolution

Cause: Multiple evolution runs on the same workspace. Fix: Each evolver.run() should operate on its own workspace copy. Use Evolver(agent="seed-name") to auto-copy the seed each time.

LLM provider errors during evolution

Cause: Rate limits or authentication issues with the evolver model. Fix: Check evolver_model config. For Bedrock, ensure AWS credentials are configured. For Anthropic, set ANTHROPIC_API_KEY.

Custom agent not picking up evolved state

Cause: Agent doesn't implement reload_from_fs(). Fix: Override reload_from_fs() in your BaseAgent subclass to re-read prompts, skills, and memory from the workspace after each evolution cycle.

Usage Instructions for Agents

When this skill is loaded:

1. Read this entire file before implementing any evolution workflow 2. Start with the Quick Start — get a minimal evolution running before customizing 3. Use built-in seeds when possible — "swe", "terminal", "mcp" have battle-tested configurations 4. Always initialize git in custom workspaces before running evolution 5. Check convergence settings — default egl_threshold=0.05 with egl_window=3 may be too aggressive for your domain 6. Inspect evolved state after each run — read prompts/system.md and skills/ to understand what the evolver learned

Pro Tips:

Set trajectory_only=False (default) so the evolver sees scores — this accelerates learning
Start with batch_size=10 and adjust based on task diversity
Use holdout_ratio=0.2 to prevent overfitting to training tasks
After evolution, git diff evo-1 evo-N shows the cumulative effect of all mutations
If the evolver isn't finding skills, enrich feedback.detail strings with specific failure reasons

Warning Signs:

Score oscillating between cycles → benchmark evaluation may be non-deterministic
Skills directory growing past 15+ skills → engine isn't merging/curating properly
Prompt growing past 10K chars → evolution is appending without refactoring
converged=True after 2-3 cycles → increase egl_window and decrease egl_threshold

References

Architecture deep dive: See references/architecture.md
API reference: See references/api.md
Step-by-step tutorials: See references/tutorials.md
Real-world examples: See references/examples.md
GitHub issues & solutions: See references/issues.md
Design patterns: See references/design-patterns.md
Release history: See references/releases.md

A-Evolve API Reference

Top-Level Module: `agent_evolve`

import agent_evolve as ae

`ae.Evolver`

Main entry point for running evolution.

class Evolver:
    def __init__(
        self,
        agent: str | BaseAgent,
        benchmark: str | BenchmarkAdapter,
        config: EvolveConfig | None = None,
        engine: EvolutionEngine | None = None,
        workspace_dir: str | None = None,
    ): ...

    def run(self, cycles: int | None = None) -> EvolutionResult: ...

Parameters:

agent: One of:
Built-in seed name: "swe", "terminal", "mcp"
Path to workspace directory: "./my-agent"
BaseAgent instance
benchmark: One of:
Built-in name: "swe-verified", "mcp-atlas", "terminal2", "skill-bench", "arc-agi-3"
BenchmarkAdapter instance
config: Evolution configuration. Defaults to EvolveConfig().
engine: Custom evolution engine. Defaults to AEvolveEngine.
workspace_dir: Override working directory for evolved state.

Resolution logic:

String agent names are matched against built-in seed workspaces, then treated as paths
Seed workspaces are copied to a working directory before evolution begins
Manifest validation ensures entrypoint and evolvable_layers are present

---

Core Types: `agent_evolve.types`

`Task`

@dataclass
class Task:
    id: str                    # Unique identifier
    input: str                 # Task description or input data
    metadata: dict = field(default_factory=dict)  # Extra context

`Trajectory`

@dataclass
class Trajectory:
    task_id: str               # Matches Task.id
    output: str                # Agent's final answer/patch/action
    steps: list[dict] = field(default_factory=list)  # Tool calls
    conversation: list[dict] = field(default_factory=list)  # Full messages

`Feedback`

@dataclass
class Feedback:
    success: bool              # Binary pass/fail
    score: float               # 0.0 to 1.0 continuous score
    detail: str = ""           # Human-readable explanation
    raw: dict = field(default_factory=dict)  # Benchmark-specific data

`Observation`

@dataclass
class Observation:
    task: Task
    trajectory: Trajectory
    feedback: Feedback

`SkillMeta`

@dataclass
class SkillMeta:
    name: str                  # Unique skill identifier
    description: str           # What it does and when to trigger
    path: str                  # Filesystem path to SKILL.md

`StepResult`

@dataclass
class StepResult:
    mutated: bool              # Whether workspace was changed
    summary: str               # Description of changes
    metadata: dict = field(default_factory=dict)

`CycleRecord`

@dataclass
class CycleRecord:
    cycle: int                       # Cycle number
    score: float                     # Average score this cycle
    mutated: bool                    # Whether workspace was changed
    engine_name: str = ""            # Name of the engine used
    summary: str = ""                # What the engine did
    observation_batch: str = ""      # Path to observation JSONL
    metadata: dict = field(default_factory=dict)

`EvolutionResult`

@dataclass
class EvolutionResult:
    cycles_completed: int
    final_score: float
    score_history: list[float] = field(default_factory=list)  # Score per cycle
    converged: bool = False
    details: dict = field(default_factory=dict)

---

Protocol: `agent_evolve.protocol.base_agent`

`BaseAgent`

class BaseAgent:
    def __init__(self, workspace_dir: str | Path): ...

    def solve(self, task: Task) -> Trajectory:
        """Override: solve a single task and return trajectory."""
        raise NotImplementedError

    def reload_from_fs(self):
        """Re-read prompts, skills, memory from workspace after evolution."""
        ...

    def export_to_fs(self):
        """Flush accumulated state (memories, skill proposals) to disk."""
        ...

    def remember(self, content: str, category: str = "episodic", **extra):
        """Buffer an episodic memory entry."""
        ...

    def get_skill_content(self, name: str) -> str:
        """Read a skill document by name."""
        ...

    @property
    def system_prompt(self) -> str:
        """Current system prompt loaded from workspace."""
        ...

    @property
    def skills(self) -> list[SkillMeta]:
        """List of available skills."""
        ...

---

Benchmarks: `agent_evolve.benchmarks.base`

`BenchmarkAdapter`

class BenchmarkAdapter:
    def get_tasks(self, split: str = "train", limit: int = 10) -> list[Task]:
        """Return tasks from the benchmark dataset."""
        raise NotImplementedError

    def evaluate(self, task: Task, trajectory: Trajectory) -> Feedback:
        """Evaluate an agent's trajectory on a task."""
        raise NotImplementedError

---

Engine: `agent_evolve.engine.base`

`EvolutionEngine`

class EvolutionEngine:
    def step(
        self,
        workspace: AgentWorkspace,
        observations: list[Observation],
        history: EvolutionHistory,
        trial: TrialRunner | None = None,
    ) -> StepResult:
        """Mutate workspace based on observations. Return what changed."""
        raise NotImplementedError

    def on_cycle_end(self, accepted: bool, score: float):
        """Optional: called after gating decision (accept/reject mutations)."""
        pass

---

Configuration: `agent_evolve.config`

`EvolveConfig`

@dataclass
class EvolveConfig:
    # Batch and cycle control
    batch_size: int = 10
    max_cycles: int = 20
    holdout_ratio: float = 0.2

    # Evolvable layers
    evolve_prompts: bool = True
    evolve_skills: bool = True
    evolve_memory: bool = True
    evolve_tools: bool = False

    # Observation transparency
    trajectory_only: bool = False    # If True, hide score/feedback from evolver

    # Evolver LLM
    evolver_model: str = "us.anthropic.claude-opus-4-6-v1"
    evolver_max_tokens: int = 16384

    # Convergence
    egl_threshold: float = 0.05
    egl_window: int = 3

    # Extension point
    extra: dict[str, Any] = field(default_factory=dict)

    @classmethod
    def from_yaml(cls, path: str) -> "EvolveConfig": ...

YAML format:

batch_size: 15
max_cycles: 30
evolve_prompts: true
evolve_skills: true
evolve_memory: false
evolver_model: us.anthropic.claude-opus-4-6-v1
egl_threshold: 0.03
egl_window: 5
extra:
  solver_proposed: true

---

Workspace: `agent_evolve.contract.workspace`

`AgentWorkspace`

class AgentWorkspace:
    def __init__(self, path: str): ...

    # Prompts
    def read_prompt(self) -> str: ...                         # Reads prompts/system.md
    def write_prompt(self, content: str) -> None: ...         # Writes prompts/system.md
    def read_fragment(self, name: str) -> str: ...            # Reads prompts/fragments/{name}
    def write_fragment(self, name: str, content: str) -> None: ...
    def list_fragments(self) -> list[str]: ...

    # Skills
    def list_skills(self) -> list[SkillMeta]: ...
    def read_skill(self, name: str) -> str: ...
    def write_skill(self, name: str, content: str) -> None: ...
    def delete_skill(self, name: str) -> None: ...

    # Drafts (proposed skills pending review)
    def list_drafts(self) -> list[dict[str, str]]: ...
    def write_draft(self, name: str, content: str) -> None: ...
    def clear_drafts(self) -> None: ...

    # Memory
    def add_memory(self, entry: dict, category: str = "episodic") -> None: ...
    def read_memories(self, category: str = "episodic", limit: int = 100) -> list[dict]: ...
    def read_all_memories(self, limit: int = 100) -> list[dict]: ...

    # Tools
    def read_tool_registry(self) -> list[dict]: ...
    def write_tool_registry(self, tools: list[dict]) -> None: ...
    def read_tool(self, name: str) -> str: ...
    def write_tool(self, name: str, content: str) -> None: ...

    # Evolution metadata
    def read_evolution_history(self) -> list[dict]: ...
    def read_evolution_metrics(self) -> dict: ...

    # Manifest
    def read_manifest(self) -> dict: ...

---

Built-in Algorithms

`agent_evolve.algorithms.skillforge.engine.AEvolveEngine`

Default LLM-driven evolution. Uses Claude with bash tool access to analyze observations and directly edit workspace files.

`agent_evolve.algorithms.guided_synth.GuidedSynthesisEngine`

Memory-first evolution: extracts minimal episodic memory from failures, then curates skill proposals.

`agent_evolve.algorithms.adaptive.AdaptiveEvolutionEngine`

Observation filtering + reward tracking + adaptive intervention density.

`agent_evolve.algorithms.adaptive_skill.AdaptiveSkillEngine`

Skill-centric: focuses exclusively on skill discovery and refinement.

---

Built-in Registries

Agent and benchmark resolution uses registries in api.py:

AGENT_REGISTRY = {
    "swe": "seed_workspaces/swe",
    "swe-verified": "seed_workspaces/swe",
    "terminal": "seed_workspaces/terminal",
    "terminal2": "seed_workspaces/terminal",
    "mcp": "seed_workspaces/mcp",
    "mcp-atlas": "seed_workspaces/mcp",
    "arc": "seed_workspaces/arc",
    ...
}

BENCHMARK_REGISTRY = {
    "swe-verified": "agent_evolve.benchmarks.swe_verified.SweVerifiedBenchmark",
    "mcp-atlas": "agent_evolve.benchmarks.mcp_atlas.McpAtlasBenchmark",
    "terminal2": "agent_evolve.benchmarks.terminal2.Terminal2Benchmark",
    "skill-bench": "agent_evolve.benchmarks.skill_bench.SkillBenchBenchmark",
    "arc-agi-3": "agent_evolve.benchmarks.arc_agi3.ArcAgi3Benchmark",
    ...
}

---

Evolution Loop: `agent_evolve.engine.loop`

`EvolutionLoop`

class EvolutionLoop:
    def __init__(
        self,
        agent: BaseAgent,
        benchmark: BenchmarkAdapter,
        engine: EvolutionEngine,
        config: EvolveConfig,
        workspace: AgentWorkspace,
    ): ...

    def run(self, cycles: int | None = None) -> EvolutionResult:
        """Run the full evolution loop for the specified number of cycles.

        Each cycle:
        1. SOLVE - Agent solves a batch of tasks
        2. OBSERVE - Benchmark evaluates, creates Observation triples
        3. PRE-SNAPSHOT - Git commit with pre-evo-N tag
        4. ENGINE.STEP - Engine mutates workspace
        5. POST-SNAPSHOT - Git commit with evo-N tag
        6. RECORD - Log CycleRecord
        7. RELOAD - agent.reload_from_fs()
        8. CONVERGE - Check score plateau
        """
        ...

Convergence Function

def _is_score_converged(
    scores: list[float],
    window: int = 3,
    epsilon: float = 0.01,
) -> bool:
    """Check if scores have plateaued.

    Returns True if the difference between max and min scores
    in the last `window` entries is less than `epsilon`.

    Note: The `epsilon` parameter defaults to 0.01 in the function
    signature. The `EvolveConfig.egl_threshold` (default 0.05) is
    passed as the `epsilon` argument when called from the loop.
    """
    if len(scores) < window:
        return False
    recent = scores[-window:]
    return (max(recent) - min(recent)) < epsilon

---

Observer: `agent_evolve.engine.observer`

`Observer`

Collects and persists observations during evolution.

class Observer:
    def __init__(self, workspace_path: str | Path): ...

    def record(self, task: Task, trajectory: Trajectory, feedback: Feedback):
        """Buffer a single observation."""
        ...

    def flush(self, batch_label: str = ""):
        """Write buffered observations to JSONL file.

        Files are written to: evolution/observations/batch_{label}.jsonl
        """
        ...

    def get_observations(self) -> list[Observation]:
        """Return buffered observations (not yet flushed)."""
        ...

`EvolutionHistory`

Query facade over past evolution cycles.

class EvolutionHistory:
    def __init__(self, workspace_path: str | Path): ...

    def get_observations(
        self,
        last_n_cycles: int | None = None,
        only_failures: bool = False,
    ) -> list[Observation]:
        """Read observations from stored JSONL files."""
        ...

    def get_score_curve(self) -> list[tuple[int, float]]:
        """Return (cycle_number, score) pairs for all completed cycles."""
        ...

    def get_workspace_diff(self, from_label: str, to_label: str) -> str:
        """Get git diff between two version labels (e.g., 'evo-1', 'evo-5')."""
        ...

    def read_file_at(self, version_label: str, path: str) -> str:
        """Read a workspace file as it existed at a given version."""
        ...

---

Version Control: `agent_evolve.engine.versioning`

`VersionControl`

class VersionControl:
    def __init__(self, workspace_path: str | Path): ...

    def init(self): ...
    def commit(self, message: str, tag: str | None = None): ...
    def get_diff(self, from_ref: str, to_ref: str) -> str: ...
    def show_file_at(self, ref: str, path: str) -> str: ...
    def list_tags(self, prefix: str = "evo-") -> list[str]: ...
    def get_log(self, max_entries: int = 50) -> list[dict]: ...

---

Skill Format Specification

Skills are stored as skills/{name}/SKILL.md with YAML frontmatter:

---
name: skill-name                    # kebab-case identifier
description: "TRIGGER when: condition. DO NOT TRIGGER: exclusion."
---

Skill Lifecycle

1. Proposal: Agent writes to skills/_drafts/ during solve() 2. Review: Evolution engine reads drafts during step() 3. Accept: Engine moves draft to skills/{name}/SKILL.md 4. Merge: Engine combines similar skills to prevent bloat 5. Refine: Engine updates skill content based on new observations

Skill Loading

# In agent's solve() method
for skill_meta in self.skills:
    content = self.get_skill_content(skill_meta.name)
    # Returns SKILL.md content (frontmatter stripped)

Skill Injection Patterns

Append to system prompt:

skill_text = "\n".join(
    f"## {s.name}\n{self.get_skill_content(s.name)}"
    for s in self.skills
)
system = f"{self.system_prompt}\n\n# Skills\n{skill_text}"

Selective injection based on task:

relevant_skills = [
    s for s in self.skills
    if task_matches_skill(task, s.description)
]

---

Memory System

Memory Categories

Category	File	Purpose
`episodic`	`memory/episodic.jsonl`	Lessons from specific task attempts
`semantic`	`memory/semantic.jsonl`	General domain knowledge
Custom	`memory/{category}.jsonl`	User-defined categories

Memory in the Agent

# Writing memory during solve()
self.remember(
    "File locks on NFS require fcntl.flock with LOCK_EX",
    category="domain_knowledge",
)

# Reading memory (loaded automatically by reload_from_fs)
for mem in self.memories:
    print(f"[{mem.get('category')}] {mem.get('content')}")

Memory in the Workspace

workspace = AgentWorkspace("./my-agent")

# Add a memory entry
workspace.add_memory(
    {"content": "Always run full test suite", "source": "cycle-5-failure"},
    category="episodic",
)

# Read memories
recent = workspace.read_memories(category="episodic", limit=20)
all_mems = workspace.read_all_memories(limit=100)

Memory Evolution

When evolve_memory=True, the evolution engine can:

Add new episodic entries summarizing failure patterns
Consolidate redundant memories
Promote episodic memories to semantic (general knowledge)
Remove stale or misleading memories

A-Evolve Architecture Deep Dive

Design Philosophy

A-Evolve treats agent optimization as a file-system mutation problem. All evolvable state — prompts, skills, memory, tools — lives as plain files in a workspace directory. Evolution engines read observations, mutate files, and git-commit snapshots. This makes every change human-readable, diffable, and rollbackable.

There are no learned weights, no gradient updates, no opaque parameters. Every mutation is an explicit edit to a text file.

System Architecture

┌─────────────────────────────────────────────────────┐
│                    Evolver API                       │
│  evolver = ae.Evolver(agent, benchmark, config)     │
│  results = evolver.run(cycles=N)                    │
└──────────────────────┬──────────────────────────────┘
                       │
              ┌────────▼────────┐
              │  EvolutionLoop  │
              └────────┬────────┘
                       │
        ┌──────────────┼──────────────┐
        │              │              │
   ┌────▼────┐  ┌──────▼──────┐  ┌───▼────┐
   │  Agent  │  │  Benchmark  │  │ Engine │
   │ solve() │  │ evaluate()  │  │ step() │
   └────┬────┘  └──────┬──────┘  └───┬────┘
        │              │              │
        └──────────────┼──────────────┘
                       │
              ┌────────▼────────┐
              │ Agent Workspace │
              │  (filesystem)   │
              └─────────────────┘

The Three Interfaces

1. BaseAgent

The BaseAgent class is the parent of all evolvable agents. It provides:

File system contract: Loads system prompts, skills, memories from workspace paths
Memory management: remember() buffers episodic entries during solve
Skill access: get_skill_content() retrieves skill documents dynamically
Hot reload: reload_from_fs() re-reads all state after evolution mutates files
Export: export_to_fs() flushes accumulated state (memories, skill proposals)

Subclasses override solve(task: Task) -> Trajectory with domain logic.

class BaseAgent:
    def __init__(self, workspace_path: str): ...
    def solve(self, task: Task) -> Trajectory: ...       # Override this
    def reload_from_fs(self): ...                         # Re-read after evolution
    def export_to_fs(self): ...                           # Flush state to disk
    def remember(self, content, category="episodic"): ... # Buffer episodic memory
    def get_skill_content(self, name: str) -> str: ...    # Read a skill

2. BenchmarkAdapter

Benchmarks provide tasks and evaluation:

class BenchmarkAdapter:
    def get_tasks(self, split="train", limit=10) -> list[Task]: ...
    def evaluate(self, task: Task, trajectory: Trajectory) -> Feedback: ...

Built-in benchmarks use entry points registered in api.py:

Registry Key	Class	Module
`swe-verified`	`SweVerifiedBenchmark`	`agent_evolve.benchmarks.swe_verified`
`mcp-atlas`	`McpAtlasBenchmark`	`agent_evolve.benchmarks.mcp_atlas`
`terminal2`	`Terminal2Benchmark`	`agent_evolve.benchmarks.terminal2`
`skill-bench`	`SkillBenchBenchmark`	`agent_evolve.benchmarks.skill_bench`
`arc-agi-3`	`ArcAgi3Benchmark`	`agent_evolve.benchmarks.arc_agi3`

3. EvolutionEngine

Engines decide how to mutate the workspace:

class EvolutionEngine:
    def step(self, workspace, observations, history, trial) -> StepResult: ...
    def on_cycle_end(self, accepted: bool): ...  # Optional callback

Arguments received:

workspace: AgentWorkspace — typed read/write access to all agent files
observations: List of Observation — recent (task, trajectory, feedback) triples
history: EvolutionHistory — query facade over past cycles and workspace versions
trial: Optional trial runner for expensive live validation

Agent Workspace Contract

The AgentWorkspace class provides typed access to workspace files:

workspace = AgentWorkspace("./my-agent")

# Prompts (reads/writes prompts/system.md)
prompt = workspace.read_prompt()
workspace.write_prompt(new_prompt)

# Prompt fragments (modular pieces in prompts/fragments/)
fragment = workspace.read_fragment("reasoning.md")
workspace.write_fragment("reasoning.md", content)

# Skills
skills = workspace.list_skills()          # Returns list of SkillMeta
content = workspace.read_skill("verify")  # Returns skill content
workspace.write_skill("verify", content)  # Write/update skill
workspace.delete_skill("obsolete")        # Remove a skill

# Memory
entries = workspace.read_memories("episodic")          # Read by category
workspace.add_memory({"lesson": "..."}, "episodic")    # Append entry
all_entries = workspace.read_all_memories(limit=100)   # All categories

# Tools
registry = workspace.read_tool_registry()
workspace.write_tool("my_tool.py", code)

Manifest Format

Every workspace has a manifest.yaml:

agent:
  type: reference
  entrypoint: agent_evolve.agents.swe.agent.SweAgent

evolvable_layers:
  - prompts
  - skills
  - memory

reload_strategy: hot    # or "cold"

entrypoint: Dotted Python path to the agent class
evolvable_layers: Which directories the engine is allowed to mutate
reload_strategy: Whether agent re-reads state mid-cycle (hot) or restarts (cold)

Evolution Loop Internals

The EvolutionLoop orchestrates each cycle:

For each cycle 1..N:
  1. SOLVE:     agent.solve(task) for each task in batch
  2. OBSERVE:   benchmark.evaluate(task, trajectory) -> Feedback
  3. SNAPSHOT:  git commit as "pre-evo-{N}"
  4. EVOLVE:    engine.step(workspace, observations, history, trial)
  5. SNAPSHOT:  git commit as "evo-{N}"
  6. RECORD:    Log cycle number, score, engine metadata
  7. RELOAD:    agent.reload_from_fs()
  8. CONVERGE:  If score plateau for egl_window cycles -> exit

Convergence Detection

The loop tracks scores over a sliding window:

# Converged if no improvement > epsilon in last window cycles
scores = [cycle.score for cycle in history[-egl_window:]]
if max(scores) - min(scores) < egl_threshold:
    return EvolutionResult(converged=True, ...)

Default: egl_threshold=0.05, egl_window=3.

Observation Format

Observations are stored as JSONL in evolution/observations/:

{
  "task_id": "django__django-16379",
  "task_input": "Fix FileBasedCache has_key ...",
  "agent_output": "--- a/django/core/cache/backends/filebased.py\n+++ ...",
  "steps": [
    {"tool": "bash", "action": "read_file", "file": "src/main.py"},
    {"tool": "bash", "action": "edit_file", "file": "src/main.py"}
  ],
  "success": true,
  "score": 0.95,
  "feedback_detail": "All tests passed"
}

Version Control Integration

Every evolution cycle creates git snapshots:

pre-evo-N: State before engine mutates the workspace
evo-N: State after engine mutates the workspace

This enables:

Rollback: git checkout evo-3 to revert to cycle 3
Diff analysis: git diff evo-1 evo-10 to see cumulative evolution
History queries: history.get_workspace_diff("evo-3", "evo-7")
File time travel: history.read_file_at("evo-5", "prompts/system.md")

Default Engine: A-Evolve/SkillForge

The default AEvolveEngine uses an LLM with bash tool access to mutate workspaces:

1. Analyze observations: Read recent task results, failures, and trajectories 2. Build context: Construct multi-part prompt with observations, existing skills, and draft proposals 3. LLM mutation: Claude with bash tools directly edits workspace files 4. Track changes: Compare skill counts and file diffs before/after

The engine effectively turns the LLM into a "developer" who reads test results and improves the agent's code/prompts accordingly. This is powerful because the evolver can make nuanced, context-aware changes that rule-based systems cannot.

Observer and History

The Observer collects observations as JSONL batches:

observer = Observer(workspace_path)
observer.record(task, trajectory, feedback)
observer.flush()  # Writes to evolution/observations/batch_XXXX.jsonl

The EvolutionHistory provides query access:

history = EvolutionHistory(workspace_path)
history.get_observations(last_n_cycles=3)
history.get_observations(only_failures=True)
history.get_score_curve()                        # List of (cycle, score)
history.get_workspace_diff("evo-1", "evo-5")     # Git diff
history.read_file_at("evo-3", "prompts/system.md")

Multi-Provider LLM Support

A-Evolve supports multiple LLM providers for both the solving agent and the evolution engine:

Provider	Config Key	Auth
Anthropic	`anthropic`	`ANTHROPIC_API_KEY` env var
OpenAI	`openai`	`OPENAI_API_KEY` env var
AWS Bedrock	`bedrock`	AWS credentials (boto3)
LiteLLM	`litellm`	Provider-specific keys

The evolver model is configured separately from the agent's model:

config = ae.EvolveConfig(
    evolver_model="us.anthropic.claude-opus-4-6-v1",  # Evolution engine model
    evolver_max_tokens=16384,
)

Agent models are configured within the seed workspace (e.g., in manifest.yaml or the agent code).

Evolution Algorithm Details

A-Evolve/SkillForge (Default)

The default engine treats evolution as a code editing problem. It gives an LLM access to bash tools and the workspace filesystem, then asks it to improve the agent based on observations.

How it works:

1. Context assembly: Builds a prompt containing:

Recent observations (task inputs, agent outputs, feedback scores and details)
Current system prompt content
Current skill library with full SKILL.md content
Pending draft proposals from the agent
Score history across cycles

2. LLM interaction: Calls the evolver model (default: Claude Opus 4.6) with bash tool access. The LLM can:

Read and edit prompts/system.md
Create, modify, or delete skills in skills/
Write episodic memory entries
Review and accept/reject draft skill proposals

3. Mutation tracking: After the LLM finishes, the engine:

Counts skill additions, modifications, and deletions
Measures prompt length change
Records a summary of what was changed and why

4. Git snapshot: All changes are committed as evo-N

Strengths:

Can make nuanced, context-aware changes
Understands relationships between prompt sections and skill content
Can refactor and consolidate (not just append)

Weaknesses:

Expensive per cycle (full LLM call with large context)
Quality depends on evolver model capability
Non-deterministic (same observations may produce different mutations)

Guided Synthesis

A memory-first approach that emphasizes learning from failures before creating skills.

How it works:

1. Failure extraction: Identifies failed tasks and extracts minimal lessons 2. Memory population: Writes episodic memory entries for each failure pattern 3. Skill proposal: After accumulating enough memories, synthesizes skill proposals 4. Curation: Reviews proposals against existing skills, accepts, merges, or skips

Best for:

Domains where the agent's base reasoning is sound but needs domain knowledge
Scenarios where skill bloat is a concern
When you want a conservative evolution strategy

Adaptive Evolution

Combines intelligent observation filtering with reward tracking.

How it works:

1. Observation filtering: Selects the most informative observations (diverse failures, novel patterns) 2. Reward tracking: Monitors score trends to adjust intervention density 3. Adaptive intervention: When score is improving, makes smaller changes; when plateaued, makes larger changes 4. Multi-objective: Can optimize for multiple metrics simultaneously

Best for:

Fine-grained control over evolution pace
Domains with noisy evaluation signals
When you need to balance exploration vs exploitation

Adaptive Skill

A skill-centric engine that focuses exclusively on building the skill library.

How it works:

1. Skill gap analysis: Identifies task categories where the agent consistently fails 2. Targeted discovery: Creates skills specifically addressing identified gaps 3. Skill refinement: Iteratively improves existing skills based on new observations 4. Library management: Merges overlapping skills, prunes unused ones

Best for:

Domains where procedural knowledge is the primary bottleneck
Building reusable skill libraries across agents
When the system prompt is already well-optimized

Workspace Lifecycle

Creation

Workspaces are created in one of three ways:

1. From seed: Evolver(agent="swe") copies seed_workspaces/swe/ to a working directory 2. From path: Evolver(agent="./my-agent") uses the directory directly 3. From agent: Evolver(agent=MyAgent("./workspace")) uses the agent's workspace

During Evolution

Each cycle modifies the workspace:

Files changed: prompts, skills, memory (as configured by evolve_* flags)
Files added: new skills, memory entries, observation batches
Git history: two commits per cycle (pre-evo-N, evo-N)

After Evolution

The workspace contains the optimized agent state:

Evolved system prompt at prompts/system.md
Discovered skills in skills/
Episodic memories in memory/
Full evolution history in evolution/
Complete git history with tagged checkpoints

The workspace is a standalone directory that can be:

Copied and reused for future evolution runs
Deployed as-is (the agent reads from the workspace at runtime)
Version-controlled independently
Shared with other developers

Error Handling and Recovery

Cycle Failure

If a cycle fails mid-execution (LLM error, timeout, etc.):

The pre-evo snapshot has already been committed
The workspace reverts to the pre-evo state
The cycle is marked as failed in the history
Evolution continues with the next cycle

Agent Failure

If the agent fails to solve a task:

The trajectory is recorded with empty output and error details
The benchmark evaluates it as a failure (score 0.0)
The failure observation is still useful for the evolver

Engine Failure

If the evolution engine fails:

The workspace remains at the pre-evo snapshot
The cycle is recorded with mutated=False
Evolution continues (the engine may succeed on the next cycle)

Recovery from Corrupted State

If the workspace is in a bad state, recover using git:

# Reset to last known good state
git checkout evo-5 -- .

# Or reset to before any evolution
git checkout evo-1 -- .

A-Evolve Design Patterns

This document describes common patterns for building effective agents and benchmarks with A-Evolve. These patterns are derived from the built-in agents that achieved top-ranking benchmark results.

---

Pattern 1: Verify-Fix Loop

Used by: SWE Agent (76.8% on SWE-bench Verified) Applicable to: Any domain with verifiable outputs

The agent runs verification after each edit, fixing issues iteratively instead of generating a single output.

Implementation

class VerifyFixAgent(ae.BaseAgent):
    def solve(self, task: ae.Task) -> ae.Trajectory:
        steps = []
        output = ""

        for attempt in range(self.max_attempts):
            # 1. Generate solution
            solution = self._generate_solution(task, output, steps)
            steps.append({"action": "generate", "attempt": attempt})

            # 2. Verify
            test_result = self._run_tests(solution)
            steps.append({"action": "verify", "passed": test_result.passed})

            if test_result.passed:
                output = solution
                break

            # 3. Fix based on test feedback
            fix_prompt = f"Tests failed:\n{test_result.errors}\n\nFix the solution."
            output = solution  # Keep last attempt
            # Next iteration will use test_result as context

        return ae.Trajectory(task_id=task.id, output=output, steps=steps)

Why It Works

Tests provide precise, actionable feedback for each attempt
Each fix is informed by specific failure details, not generic retry
Converges faster than single-shot generation
Works with any domain that has automated verification

Evolution Interaction

The evolver can improve this pattern by:

Prompt: Teaching the agent better debugging strategies
Skills: Adding "common fix patterns" for recurring failure types
Memory: Recording which test failures indicate which root causes

---

Pattern 2: Hypothesis-First Exploration

Used by: SWE Agent Applicable to: Debugging, investigation, analysis tasks

Before exploring the codebase, the agent forms a hypothesis about the root cause and tests it directly.

Implementation

class HypothesisFirstAgent(ae.BaseAgent):
    def solve(self, task: ae.Task) -> ae.Trajectory:
        steps = []

        # 1. Form hypothesis from task description
        hypothesis = self._form_hypothesis(task.input)
        steps.append({"action": "hypothesize", "hypothesis": hypothesis})

        # 2. Design minimal test
        test_plan = self._design_test(hypothesis)
        steps.append({"action": "plan_test", "plan": test_plan})

        # 3. Execute test (targeted exploration)
        evidence = self._execute_test(test_plan)
        steps.append({"action": "test", "evidence": evidence})

        # 4. If hypothesis confirmed, fix directly
        # If refuted, form new hypothesis with new information
        if evidence.supports_hypothesis:
            solution = self._implement_fix(hypothesis, evidence)
        else:
            # Refine and retry
            solution = self._explore_and_fix(task, evidence)

        return ae.Trajectory(task_id=task.id, output=solution, steps=steps)

Why It Works

Reduces exploration time by 60-80% compared to breadth-first search
Focuses the agent's limited context window on the most relevant code
Forms a narrative (hypothesis → evidence → conclusion) that improves reasoning
Failed hypotheses still provide useful information (rules out possibilities)

System Prompt Pattern

Include this in the evolved prompt:

## Approach
1. Read the issue carefully and form a SPECIFIC hypothesis about the root cause
2. Identify the MINIMUM number of files to read to test your hypothesis
3. Read those files and check if your hypothesis is correct
4. If correct, implement the fix. If wrong, form a new hypothesis.

NEVER: Start by listing all files in the repository
NEVER: Read more than 3 files before forming a hypothesis

---

Pattern 3: Skill Injection via System Prompt

Used by: All built-in agents Applicable to: Any domain

The agent reads evolved skills and injects them into the LLM's system prompt, making skill knowledge available at inference time.

Implementation

class SkillAwareAgent(ae.BaseAgent):
    def solve(self, task: ae.Task) -> ae.Trajectory:
        # 1. Build system prompt with all skills
        system = self.system_prompt

        # 2. Append skill content
        if self.skills:
            skill_sections = []
            for skill_meta in self.skills:
                content = self.get_skill_content(skill_meta.name)
                skill_sections.append(
                    f"### {skill_meta.name}\n"
                    f"*{skill_meta.description}*\n\n"
                    f"{content}"
                )
            system += "\n\n## Learned Skills\n\n" + "\n\n".join(skill_sections)

        # 3. Append relevant memories
        if self.memories:
            memory_text = "\n".join(
                f"- {m['content']}" for m in self.memories[-10:]
            )
            system += f"\n\n## Lessons Learned\n{memory_text}"

        # 4. Call LLM with enriched prompt
        response = self._call_llm(system=system, user=task.input)
        return ae.Trajectory(task_id=task.id, output=response)

Why It Works

Skills provide domain-specific procedures that the base model doesn't have
Memory provides recent lessons that prevent repeated mistakes
The system prompt grows organically with each evolution cycle
Skills have TRIGGER conditions so the LLM knows when to apply them

Skill Filtering (Advanced)

For agents with many skills, filter to relevant ones:

def _get_relevant_skills(self, task: ae.Task) -> list[ae.SkillMeta]:
    """Select skills whose TRIGGER matches the task."""
    relevant = []
    for skill in self.skills:
        # Simple keyword matching
        trigger = skill.description.lower()
        task_text = task.input.lower()
        if any(keyword in task_text for keyword in self._extract_keywords(trigger)):
            relevant.append(skill)
    return relevant or self.skills[:5]  # Fallback to first 5

---

Pattern 4: Concurrent Timeout Enforcement

Used by: Terminal Agent (76.5% on Terminal-Bench 2.0) Applicable to: Tasks with wall-clock time constraints

Wraps the solve logic in a timeout to prevent hanging on difficult tasks.

Implementation

from concurrent.futures import ThreadPoolExecutor, TimeoutError

class TimedAgent(ae.BaseAgent):
    def __init__(self, workspace_dir, timeout_seconds=300):
        super().__init__(workspace_dir)
        self.timeout = timeout_seconds

    def solve(self, task: ae.Task) -> ae.Trajectory:
        with ThreadPoolExecutor(max_workers=1) as pool:
            future = pool.submit(self._solve_inner, task)
            try:
                return future.result(timeout=self.timeout)
            except TimeoutError:
                return ae.Trajectory(
                    task_id=task.id,
                    output="TIMEOUT: Task exceeded time limit",
                    steps=[{"action": "timeout", "limit": self.timeout}],
                )

    def _solve_inner(self, task: ae.Task) -> ae.Trajectory:
        # Actual solving logic (may take a long time)
        ...

Why It Works

Prevents a single hard task from blocking the entire evolution cycle
Returns a failed trajectory instead of hanging (evolver can learn from timeout pattern)
Keeps cycle time predictable and bounded

---

Pattern 5: Progressive Prompt Refinement

Evolved pattern: The evolver discovers this organically during evolution

Rather than rewriting the prompt from scratch, the evolver makes incremental additions:

Cycle 1: Base prompt (as written by human)

You are an expert software engineer.

Cycle 3: Add approach section

You are an expert software engineer.

## Approach
1. Form a hypothesis about the root cause
2. Verify with minimal exploration
3. Implement a targeted fix

Cycle 5: Add error handling

You are an expert software engineer.

## Approach
1. Form a hypothesis about the root cause
2. Verify with minimal exploration
3. Implement a targeted fix

## Common Mistakes to Avoid
- Don't modify test files
- Always run the full test suite, not just the failing test
- Check for import side effects before editing __init__.py

Cycle 8: Consolidate and refactor

You are an expert software engineer who fixes bugs systematically.

## Method
1. HYPOTHESIZE: Read the issue and predict the root cause before exploring code
2. VERIFY: Read ≤3 files to confirm. If wrong, re-hypothesize with new information
3. FIX: Make the minimal change that addresses the root cause
4. TEST: Run the full test suite. If tests fail, read the error and iterate

## Rules
- Never modify test files
- Never read more than 5 files before attempting a fix
- Always check import side effects in __init__.py files

Why It Works

Each cycle adds knowledge from observed failures
The evolver can see which rules helped (via score improvements)
Consolidation prevents prompt bloat
The prompt becomes a distilled version of "what works"

---

Pattern 6: Observation-Enriched Feedback

Key insight: The quality of evolution depends heavily on the quality of feedback.

Poor Feedback (limits evolution)

def evaluate(self, task, trajectory):
    return ae.Feedback(success=passed, score=1.0 if passed else 0.0, detail="")

Rich Feedback (enables targeted evolution)

def evaluate(self, task, trajectory):
    test_results = run_tests(trajectory.output)
    failures = [t for t in test_results if not t.passed]
    
    detail_parts = []
    if failures:
        for f in failures[:3]:  # Top 3 failures
            detail_parts.append(f"FAIL {f.test_name}: {f.error_type} — {f.message[:100]}")
    
    detail_parts.append(f"Passed {len(test_results) - len(failures)}/{len(test_results)} tests")
    
    if trajectory.output:
        detail_parts.append(f"Output: {len(trajectory.output)} chars, {trajectory.output.count('\\n')} lines")
    
    score = (len(test_results) - len(failures)) / max(len(test_results), 1)
    
    return ae.Feedback(
        success=len(failures) == 0,
        score=score,
        detail="; ".join(detail_parts),
        raw={"test_results": [t.to_dict() for t in test_results]},
    )

Why It Works

The evolver reads feedback.detail to understand why the agent failed
Specific error messages help the evolver create targeted skills
Partial scores (0.7 instead of 0.0) show progress even when not fully passing
raw data enables the evolver to do deeper analysis if needed

---

Pattern 7: Multi-Model Agent Architecture

Advanced pattern: Use different models for different tasks within the same agent.

Implementation

class MultiModelAgent(ae.BaseAgent):
    def __init__(self, workspace_dir):
        super().__init__(workspace_dir)
        self.planning_model = "claude-opus-4-6-20250514"      # Strong reasoning
        self.execution_model = "claude-sonnet-4-20250514"      # Fast execution
        self.review_model = "claude-haiku-4-5-20251001"        # Quick validation

    def solve(self, task: ae.Task) -> ae.Trajectory:
        steps = []

        # 1. Plan with strong model
        plan = self._call(self.planning_model, 
            f"Analyze this task and create a plan:\n{task.input}")
        steps.append({"phase": "plan", "model": self.planning_model})

        # 2. Execute with fast model
        solution = self._call(self.execution_model,
            f"Execute this plan:\n{plan}\n\nTask:\n{task.input}")
        steps.append({"phase": "execute", "model": self.execution_model})

        # 3. Review with lightweight model
        review = self._call(self.review_model,
            f"Check this solution for obvious errors:\n{solution}")
        steps.append({"phase": "review", "model": self.review_model})

        if "error" in review.lower():
            # Fix errors with strong model
            solution = self._call(self.planning_model,
                f"Fix these issues:\n{review}\n\nSolution:\n{solution}")
            steps.append({"phase": "fix", "model": self.planning_model})

        return ae.Trajectory(task_id=task.id, output=solution, steps=steps)

Cost Optimization

Phase	Model	Cost	Reasoning Quality
Planning	Opus	High	Maximum
Execution	Sonnet	Medium	Good
Review	Haiku	Low	Sufficient
Fix (if needed)	Opus	High	Maximum

Typical cost reduction: 40-60% vs using Opus for everything.

---

Pattern 8: Workspace Partitioning for Multi-Stage Evolution

Run different evolution stages on different workspace layers.

Stage 1: Prompt evolution only

config_stage1 = ae.EvolveConfig(
    evolve_prompts=True,
    evolve_skills=False,
    evolve_memory=False,
    max_cycles=10,
)

Stage 2: Skill discovery (prompt locked)

config_stage2 = ae.EvolveConfig(
    evolve_prompts=False,
    evolve_skills=True,
    evolve_memory=True,
    max_cycles=15,
)

Stage 3: Joint refinement

config_stage3 = ae.EvolveConfig(
    evolve_prompts=True,
    evolve_skills=True,
    evolve_memory=True,
    max_cycles=10,
    egl_threshold=0.01,  # Fine-grained convergence
)

Why It Works

Prompt optimization first establishes a strong foundation
Skills built on a good prompt are more focused
Joint refinement catches interactions between layers
Total cost may be lower than single-stage evolution

---

Anti-Patterns

Anti-Pattern 1: Unbounded Prompt Growth

Problem: Evolver keeps appending rules without consolidating. Symptom: Prompt grows to 15K+ chars, agent performance degrades. Fix: Periodically run a consolidation-focused cycle, or set max prompt length in config.

Anti-Pattern 2: Skill Library Bloat

Problem: Every failure gets its own skill. Symptom: 30+ narrow skills like "handle-empty-list" and "check-null-return". Fix: Use the default SkillForge engine which merges overlapping skills. Target 5-10 broad skills.

Anti-Pattern 3: Memory Without Curation

Problem: Every observation generates a memory entry. Symptom: Hundreds of entries, many contradictory or outdated. Fix: Only remember() lessons that are genuinely reusable. Let the evolver curate and consolidate.

Anti-Pattern 4: Overfitting to Training Tasks

Problem: Agent scores 95% on training but 60% on holdout. Symptom: Skills are too specific to training task patterns. Fix: Use holdout_ratio=0.2 to maintain a validation set. Ensure training tasks are diverse.

Anti-Pattern 5: Ignoring Convergence

Problem: Running 50 cycles when score plateaued at cycle 10. Symptom: Wasted compute, no improvement in last 40 cycles. Fix: Set appropriate egl_threshold and egl_window. Check results.converged flag.

A-Evolve Real-World Examples

Example 1: Evolve a SWE-Bench Agent

The most common use case — optimize an agent that solves GitHub issues.

Minimal Run

import agent_evolve as ae

evolver = ae.Evolver(agent="swe", benchmark="swe-verified")
results = evolver.run(cycles=10)
print(f"Score: {results.final_score:.1%}")

Full Configuration

import agent_evolve as ae

config = ae.EvolveConfig(
    batch_size=15,
    max_cycles=30,
    evolve_prompts=True,
    evolve_skills=True,
    evolve_memory=True,
    evolver_model="us.anthropic.claude-opus-4-6-v1",
    egl_threshold=0.03,    # Tighter convergence
    egl_window=5,          # Longer patience
)

evolver = ae.Evolver(
    agent="swe",
    benchmark="swe-verified",
    config=config,
)
results = evolver.run()

# Inspect evolution trajectory
for i, score in enumerate(results.score_history):
    print(f"Cycle {i + 1}: {score:.3f}")

Expected Output

Cycle 1: 0.620 — Established baseline, no mutations
Cycle 2: 0.640 — Added verify-before-submit skill
Cycle 3: 0.680 — Refined system prompt to prioritize test discovery
Cycle 4: 0.720 — Added edge-case-testing skill, merged with verify
Cycle 5: 0.730 — Memory: common Django test patterns
Cycle 6: 0.740 — Prompt: explicit hypothesis-first workflow
Cycle 7: 0.740 — No improvement
Cycle 8: 0.745 — Minor skill refinement
Cycle 9: 0.750 — Converged (< 0.03 improvement over 5 cycles)
Final score: 0.750

---

Example 2: Batch Solve Without Evolution

Run the agent across many tasks in parallel without evolving — useful for benchmarking a snapshot.

import agent_evolve as ae
from concurrent.futures import ThreadPoolExecutor, as_completed

# Load agent and benchmark
evolver = ae.Evolver(agent="swe", benchmark="swe-verified")
agent = evolver._agent
benchmark = evolver._benchmark

# Get all tasks
tasks = benchmark.get_tasks(split="test", limit=50)

results = []
with ThreadPoolExecutor(max_workers=8) as pool:
    futures = {pool.submit(agent.solve, task): task for task in tasks}
    for future in as_completed(futures):
        task = futures[future]
        trajectory = future.result()
        feedback = benchmark.evaluate(task, trajectory)
        results.append((task.id, feedback.score, feedback.success))
        print(f"{task.id}: {'✓' if feedback.success else '✗'} ({feedback.score:.2f})")

# Summary
passed = sum(1 for _, _, s in results if s)
print(f"\nTotal: {passed}/{len(results)} ({passed/len(results):.1%})")

---

Example 3: Sequential Evolution with Feedback Modes

Compare evolution with and without score visibility:

import agent_evolve as ae

# Mode 1: Evolver sees full feedback (scores + details)
config_full = ae.EvolveConfig(
    batch_size=10,
    max_cycles=10,
    trajectory_only=False,
)
evolver_full = ae.Evolver(agent="swe", benchmark="swe-verified", config=config_full)
results_full = evolver_full.run()

# Mode 2: Evolver only sees trajectories (must infer quality)
config_blind = ae.EvolveConfig(
    batch_size=10,
    max_cycles=10,
    trajectory_only=True,
)
evolver_blind = ae.Evolver(agent="swe", benchmark="swe-verified", config=config_blind)
results_blind = evolver_blind.run()

print(f"Full feedback: {results_full.final_score:.1%}")
print(f"Blind mode:    {results_blind.final_score:.1%}")

---

Example 4: Custom Agent for Code Review

Build an agent that reviews pull requests and evolve it:

import agent_evolve as ae
import anthropic

class CodeReviewAgent(ae.BaseAgent):
    def __init__(self, workspace_path: str):
        super().__init__(workspace_path)
        self.client = anthropic.Anthropic()

    def solve(self, task: ae.Task) -> ae.Trajectory:
        # Build prompt with evolved system prompt and skills
        messages = [
            {"role": "user", "content": f"Review this diff:\n\n{task.input}"}
        ]

        # Inject skills into system prompt
        skill_text = "\n".join(
            f"## {s.name}\n{self.get_skill_content(s.name)}"
            for s in self.skills
        )
        system = f"{self.system_prompt}\n\n# Available Skills\n{skill_text}"

        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system=system,
            messages=messages,
        )
        output = response.content[0].text

        return ae.Trajectory(
            task_id=task.id,
            output=output,
            steps=[{"tool": "llm", "action": "review", "tokens": response.usage.output_tokens}],
        )


class CodeReviewBenchmark(ae.BenchmarkAdapter):
    def __init__(self, dataset_path: str):
        self.dataset_path = dataset_path

    def get_tasks(self, split="train", limit=None):
        import json
        with open(f"{self.dataset_path}/{split}.jsonl") as f:
            items = [json.loads(line) for line in f]
        if limit:
            items = items[:limit]
        return [
            ae.Task(
                id=item["id"],
                input=item["diff"],
                metadata={"expected_comments": item["comments"]},
            )
            for item in items
        ]

    def evaluate(self, task, trajectory):
        expected = set(task.metadata["expected_comments"])
        actual = set(extract_comments(trajectory.output))
        tp = len(expected & actual)
        precision = tp / (len(actual) + 1e-9)
        recall = tp / (len(expected) + 1e-9)
        f1 = 2 * precision * recall / (precision + recall + 1e-9)
        return ae.Feedback(
            success=f1 > 0.6,
            score=f1,
            detail=f"Found {tp}/{len(expected)} issues (P={precision:.2f} R={recall:.2f})",
        )


# Set up workspace
# mkdir -p my-reviewer/prompts my-reviewer/skills my-reviewer/memory
# Write manifest.yaml and prompts/system.md

evolver = ae.Evolver(
    agent=CodeReviewAgent("./my-reviewer"),
    benchmark=CodeReviewBenchmark("./review-data"),
    config=ae.EvolveConfig(batch_size=5, max_cycles=15),
)
results = evolver.run()

---

Example 5: Custom Evolution Engine

A rule-based engine that appends learned patterns to the system prompt:

import agent_evolve as ae
import re
from collections import Counter

class PatternLearningEngine(ae.EvolutionEngine):
    def step(self, workspace, observations, history, trial):
        failures = [o for o in observations if not o.feedback.success]
        if not failures:
            return ae.StepResult(mutated=False, summary="All passed, no mutations needed")

        # Categorize failure patterns
        patterns = Counter()
        for obs in failures:
            detail = obs.feedback.detail.lower()
            if "timeout" in detail:
                patterns["timeout"] += 1
            elif "assertion" in detail or "test" in detail:
                patterns["test_failure"] += 1
            elif "syntax" in detail or "parse" in detail:
                patterns["syntax_error"] += 1
            else:
                patterns["unknown"] += 1

        # Generate rules for top patterns
        rules = []
        if patterns["timeout"] > 0:
            rules.append("- Before submitting, verify the solution completes within time limits")
        if patterns["test_failure"] > 1:
            rules.append("- Run ALL related tests, not just the failing one")
        if patterns["syntax_error"] > 0:
            rules.append("- Validate syntax after every edit")

        if not rules:
            return ae.StepResult(mutated=False, summary="No actionable patterns found")

        # Append rules to prompt
        prompt = workspace.read_prompt()
        rule_block = "\n\n## Learned Rules (Auto-Generated)\n" + "\n".join(rules)
        workspace.write_prompt(prompt + rule_block)

        return ae.StepResult(
            mutated=True,
            summary=f"Added {len(rules)} rules from {len(failures)} failures",
            metadata={"patterns": dict(patterns), "rules": rules},
        )

# Use the custom engine
evolver = ae.Evolver(
    agent="swe",
    benchmark="swe-verified",
    engine=PatternLearningEngine(),
)
results = evolver.run(cycles=10)

---

Example 6: Inspecting Evolution History

After an evolution run, analyze what happened:

import agent_evolve as ae

evolver = ae.Evolver(agent="./evolved-swe", benchmark="swe-verified")
results = evolver.run(cycles=5)

# Access workspace for post-mortem
workspace = evolver._workspace

# Read the evolved system prompt
final_prompt = workspace.read_prompt()
print(f"Final prompt length: {len(final_prompt)} chars")

# List discovered skills
for skill in workspace.list_skills():
    print(f"  Skill: {skill.name} — {skill.description}")

# Read evolution history
history = evolver._history
scores = history.get_score_curve()
for cycle, score in scores:
    print(f"  Cycle {cycle}: {score:.3f}")

# Compare workspace at different points
diff = history.get_workspace_diff("evo-1", "evo-5")
print(f"\nChanges from cycle 1 to 5:\n{diff}")

# Read prompt as it was at cycle 3
old_prompt = history.read_file_at("evo-3", "prompts/system.md")

---

Example 7: Workspace Setup from Scratch

Create a new agent workspace manually:

mkdir -p my-agent/{prompts,skills,memory,tools}

# manifest.yaml
cat > my-agent/manifest.yaml << 'EOF'
agent:
  type: reference
  entrypoint: my_module.agent.MyAgent
evolvable_layers:
  - prompts
  - skills
  - memory
reload_strategy: hot
EOF

# System prompt
cat > my-agent/prompts/system.md << 'EOF'
You are an expert assistant. Analyze the given task carefully, break it into steps, and produce a high-quality solution.

## Approach
1. Understand the task requirements
2. Plan your approach
3. Execute step by step
4. Verify your solution
EOF

# Initialize git for version tracking
cd my-agent && git init && git add -A && git commit -m "Initial workspace"

Then point the evolver at it:

evolver = ae.Evolver(agent="./my-agent", benchmark=MyBenchmark())

A-Evolve: Common Issues & Solutions

Issue 1: `ModuleNotFoundError: No module named 'agent_evolve'`

Context: Running evolution script after pip install.

Solution: Ensure you installed the package correctly:

# From source
pip install -e .

# From PyPI
pip install a-evolve

# With provider support
pip install a-evolve[anthropic]    # For Claude
pip install a-evolve[bedrock]      # For AWS Bedrock
pip install a-evolve[all]          # Everything

If using a virtual environment, verify activation:

which python   # Should point to your venv
python -c "import agent_evolve; print(agent_evolve.__file__)"

---

Issue 2: Evolution Score Stays Flat After Multiple Cycles

Symptoms: Score doesn't improve beyond cycle 1-2 baseline.

Root causes and fixes:

1. Batch too small: With batch_size=3, the evolver sees too few observations to identify patterns. Increase to 10-15.

2. Benchmark tasks too similar: If all tasks test the same skill, there's no diversity signal. Ensure get_tasks() returns varied difficulties.

3. Evolver can't see scores: If trajectory_only=True, the evolver must infer quality from trajectories alone. Set trajectory_only=False for faster learning.

4. Skills not loaded by agent: Verify that reload_from_fs() actually re-reads skills and injects them into the LLM prompt. Common mistake: loading skills at __init__ but not reloading them.

# Debug: print what the agent sees after each cycle
class MyAgent(ae.BaseAgent):
    def reload_from_fs(self):
        super().reload_from_fs()
        print(f"Reloaded {len(self.skills)} skills")
        print(f"Prompt length: {len(self.system_prompt)} chars")

---

Issue 3: `FileNotFoundError: manifest.yaml not found`

Context: Passing a workspace path to Evolver.

Solution: Every workspace must have a manifest.yaml at the root:

agent:
  type: reference
  entrypoint: my_module.MyAgent
evolvable_layers:
  - prompts
  - skills
reload_strategy: hot

Verify the file exists:

ls -la ./my-workspace/manifest.yaml

---

Issue 4: Git Errors During Evolution Snapshots

Symptoms: fatal: not a git repository or merge conflicts.

Root causes:

1. Workspace not a git repo: Initialize before running evolution:

cd my-workspace && git init && git add -A && git commit -m "Initial workspace"

2. Dirty working tree: Uncommitted changes from a previous run. Reset or commit:

cd my-workspace && git add -A && git commit -m "Clean state"

3. Concurrent evolution on same workspace: Each evolver.run() must operate on its own workspace copy. Use the built-in seed copy mechanism:

# This auto-copies the seed to a fresh working directory
evolver = ae.Evolver(agent="swe", benchmark="swe-verified")

---

Issue 5: AWS Bedrock Authentication Failures

Symptoms: botocore.exceptions.NoCredentialsError when using Bedrock models.

Solution:

# Option 1: Environment variables
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-west-2

# Option 2: AWS CLI profile
aws configure

# Option 3: IAM role (on EC2/ECS)
# Ensure instance role has bedrock:InvokeModel permission

Verify access:

import boto3
client = boto3.client("bedrock-runtime", region_name="us-west-2")
# Should not raise an error

---

Issue 6: Anthropic Rate Limits During Evolution

Symptoms: RateLimitError or 429 responses mid-evolution.

Solution: The evolver makes LLM calls to mutate the workspace, in addition to agent solve calls. For high batch sizes, this can exceed rate limits.

Mitigation:

Reduce batch_size (fewer concurrent solve calls)
Add retry logic in your agent's solve() method
Use Bedrock instead of direct Anthropic API (higher default limits)
Stagger evolution cycles with short pauses between them

---

Issue 7: Skills Not Being Discovered

Symptoms: After 10+ cycles, skills/ directory remains empty.

Root causes:

1. `evolve_skills=False` in config. Enable it:

config = ae.EvolveConfig(evolve_skills=True)

2. Engine doesn't support skill creation: The default AEvolveEngine does. Custom engines must explicitly write to workspace.write_skill().

3. Evolver lacks sufficient context: Ensure observations include detailed failure feedback, not just pass/fail booleans. Richer feedback.detail strings help the evolver identify skill-worthy patterns.

---

Issue 8: Agent Doesn't Pick Up Evolved Prompts

Symptoms: Agent behavior doesn't change between cycles despite prompt mutations.

Root cause: Agent caches the system prompt at initialization and doesn't re-read.

Fix: Implement reload_from_fs() properly:

class MyAgent(ae.BaseAgent):
    def __init__(self, workspace_path):
        super().__init__(workspace_path)
        self._load_state()

    def _load_state(self):
        self._cached_prompt = self.system_prompt
        self._cached_skills = [
            self.get_skill_content(s.name) for s in self.skills
        ]

    def reload_from_fs(self):
        super().reload_from_fs()  # Re-reads files from disk
        self._load_state()        # Update cached state

---

Issue 9: `EvolutionResult.converged=True` Too Early

Symptoms: Evolution stops after 3-4 cycles even though score is low.

Cause: Default convergence settings are too aggressive for slow-improving domains.

Fix: Increase the convergence window and decrease threshold:

config = ae.EvolveConfig(
    egl_threshold=0.01,   # Require < 1% improvement to converge (default 5%)
    egl_window=5,          # Look at 5 cycles instead of 3
    max_cycles=50,         # Allow more cycles
)

---

Issue 10: Memory Overflow with Large Trajectories

Symptoms: Python OOM when processing benchmarks with very long agent conversations.

Root cause: Full conversation history stored in Trajectory.conversation for every task.

Mitigation:

Truncate conversations in your agent's solve() before returning
Store only the final output and key tool calls in steps
Use smaller batch sizes to limit concurrent memory usage

def solve(self, task):
    # ... run agent ...
    return ae.Trajectory(
        task_id=task.id,
        output=final_answer,
        steps=key_steps_only,        # Not full conversation
        conversation=[],              # Skip if not needed for evolution
    )

---

Issue 11: Workspace Too Large After Many Cycles

Symptoms: .git directory grows to several GB after 20+ cycles.

Cause: Git stores full snapshots of observation JSONL files (which can be large).

Mitigation:

# Clean up old observation batches (keep last 5 cycles)
cd my-workspace
find evolution/observations/ -name "batch_*.jsonl" -mtime +7 -delete
git add -A && git commit -m "Prune old observations"

# Alternatively, use git gc
git gc --aggressive

Or configure the evolver to not track observations in git:

# In manifest.yaml
evolution:
  track_observations: false

---

Issue 12: Custom Benchmark Returns Inconsistent Scores

Symptoms: Evolution oscillates — score goes up then down between cycles.

Root cause: Non-deterministic evaluation or tasks sampled differently each cycle.

Fix:

Use a fixed random seed in get_tasks() for reproducible task selection
Ensure evaluate() is deterministic (no randomness in scoring)
Use holdout_ratio to keep a consistent test set:

config = ae.EvolveConfig(holdout_ratio=0.2)  # 20% held out for validation

---

Issue 13: Evolution Produces Overly Long System Prompts

Symptoms: System prompt grows to 10K+ characters after many cycles. Agent performance may degrade due to instruction overload.

Root cause: The default SkillForge engine sometimes appends rules without consolidating existing ones.

Fix:

1. Manual pruning: After evolution, review the prompt and remove redundant sections:

cd my-workspace
wc -c prompts/system.md    # Check size
git diff evo-1 evo-N -- prompts/system.md  # See what was added

2. Run a consolidation cycle: Use the evolver to refactor:

# Create a config that focuses on prompt refinement
config = ae.EvolveConfig(
    batch_size=10,
    max_cycles=3,
    evolve_prompts=True,
    evolve_skills=False,
    evolve_memory=False,
    extra={"consolidate_prompt": True},
)

3. Use fragments instead of one large prompt: Split the prompt into modular fragments that the evolver can manage independently:

prompts/
├── system.md           # Core identity (keep short)
└── fragments/
    ├── reasoning.md    # Reasoning approach
    ├── output.md       # Output formatting
    └── domain.md       # Domain-specific rules

---

Issue 14: Skill Proposals Never Get Accepted

Symptoms: Agent proposes skills via _drafts/ directory, but the evolver never promotes them to skills/.

Root cause: The SkillForge engine may not be configured to read drafts, or the proposals are too narrow.

Fix:

1. Enable solver-proposed skills in config:

config = ae.EvolveConfig(
    extra={"solver_proposed": True}
)

2. Improve proposal quality in your agent:

def solve(self, task):
    # ... solve the task ...

    # Propose a skill if you learned something reusable
    if learned_pattern:
        draft_content = f"""---
name: {pattern_name}
description: "TRIGGER when: {trigger}. DO NOT TRIGGER: {exclusion}."
---

{pattern_description}

## Steps
{steps}
"""
        # Write to drafts directory
        workspace = AgentWorkspace(self._workspace_dir)
        workspace.write_draft(pattern_name, draft_content)

3. Use the GuidedSynthesisEngine which prioritizes skill curation:

from agent_evolve.algorithms.guided_synth import GuidedSynthesisEngine
evolver = ae.Evolver(agent="./my-agent", benchmark=bm, engine=GuidedSynthesisEngine(config))

---

Issue 15: Different Results on Each Evolution Run

Symptoms: Running the same config on the same seed produces different final scores.

Root cause: LLM-driven evolution is inherently non-deterministic. The evolver model, agent model, and benchmark task sampling all introduce randomness.

Mitigation:

1. Fix task ordering with a seed:

class MyBenchmark(ae.BenchmarkAdapter):
    def get_tasks(self, split="train", limit=10):
        tasks = load_all_tasks(split)
        random.seed(42)          # Fixed seed
        random.shuffle(tasks)
        return tasks[:limit]

2. Run multiple evolution trials and compare:

scores = []
for trial in range(5):
    evolver = ae.Evolver(agent="swe", benchmark="swe-verified")
    result = evolver.run(cycles=10)
    scores.append(result.final_score)

print(f"Mean: {sum(scores)/len(scores):.3f}")
print(f"Std:  {(sum((s - sum(scores)/len(scores))**2 for s in scores) / len(scores))**0.5:.3f}")

3. Use temperature=0 in your agent's LLM calls for deterministic behavior (note: evolution engine calls remain stochastic).

---

Issue 16: Workspace Manifest Validation Errors

Symptoms: ValueError: Missing required field 'entrypoint' in manifest.yaml

Root cause: Manifest format doesn't match expected schema.

Fix: Ensure manifest has all required fields:

# Required format
agent:
  type: reference                              # Must be "reference"
  entrypoint: my_module.my_agent.MyAgentClass  # Dotted Python path

evolvable_layers:                              # At least one layer
  - prompts
  - skills
  - memory

reload_strategy: hot                           # "hot" or "cold"

Common mistakes:

Missing agent.type field (must be "reference")
entrypoint is a file path instead of a Python dotted path
evolvable_layers is empty or missing
YAML indentation errors (use 2 spaces, not tabs)

Validate your manifest:

import yaml
with open("manifest.yaml") as f:
    manifest = yaml.safe_load(f)
assert "agent" in manifest
assert "entrypoint" in manifest["agent"]
assert "evolvable_layers" in manifest
print("Manifest OK")

---

Issue 17: Agent Cannot Import Custom Modules

Symptoms: ModuleNotFoundError when the evolver tries to instantiate the agent from manifest.yaml entrypoint.

Root cause: The custom agent module is not on the Python path.

Fix:

1. Install your agent as a package:

pip install -e .   # If your project has a pyproject.toml

2. Or add the directory to PYTHONPATH:

export PYTHONPATH="${PYTHONPATH}:/path/to/my/agent"

3. Or use an absolute import path in the manifest:

agent:
  entrypoint: my_package.agents.custom.CustomAgent

Verify the import works:

import importlib
module_path, class_name = "my_package.agents.custom.CustomAgent".rsplit(".", 1)
mod = importlib.import_module(module_path)
cls = getattr(mod, class_name)
print(f"Found: {cls}")

---

Issue 18: Evolution Takes Too Long Per Cycle

Symptoms: Each evolution cycle takes 30+ minutes.

Root causes and fixes:

1. Large batch_size: Each task requires a full agent solve. Reduce:

config = ae.EvolveConfig(batch_size=5)  # Fewer tasks per cycle

2. Agent is slow per task: Profile your solve() method:

import time

class MyAgent(ae.BaseAgent):
    def solve(self, task):
        start = time.time()
        result = self._actual_solve(task)
        elapsed = time.time() - start
        print(f"Task {task.id}: {elapsed:.1f}s")
        return result

3. Evolver model is too large: Try a smaller model:

config = ae.EvolveConfig(
    evolver_model="us.anthropic.claude-sonnet-4-6-v1",  # Faster evolver
)

4. Observations too large: Truncate trajectories before observation:

def solve(self, task):
    # ... solve ...
    return ae.Trajectory(
        task_id=task.id,
        output=result,
        steps=steps[-10:],       # Only last 10 steps
        conversation=[],          # Skip full conversation
    )

---

Issue 19: Skills Conflicting with System Prompt

Symptoms: Agent behavior degrades after skill discovery because skills contradict the base prompt.

Root cause: The evolver created skills with instructions that conflict with the system prompt's approach.

Fix:

1. Review and remove conflicting skills:

workspace = ae.AgentWorkspace("./my-agent")
for skill in workspace.list_skills():
    content = workspace.read_skill(skill.name)
    print(f"\n--- {skill.name} ---")
    print(content[:300])
    # Manually delete: workspace.delete_skill(skill.name)

2. Lock the prompt during skill evolution:

config = ae.EvolveConfig(
    evolve_prompts=False,   # Don't change the prompt
    evolve_skills=True,     # Only evolve skills
)

3. Add constraints to skill descriptions: Skills with clear TRIGGER/DO NOT TRIGGER conditions are less likely to conflict:

---
name: verify-output-format
description: "TRIGGER when: agent has produced final output. DO NOT TRIGGER: during intermediate reasoning steps."
---

---

Issue 20: Holdout Set Leaking into Training

Symptoms: Training score and holdout score are suspiciously close, or holdout score drops when training score increases.

Root cause: Benchmark get_tasks() returns overlapping tasks for different splits.

Fix: Ensure strict separation:

class MyBenchmark(ae.BenchmarkAdapter):
    def __init__(self, data_path):
        all_data = load_data(data_path)
        # Deterministic split
        random.seed(42)
        random.shuffle(all_data)
        split_idx = int(len(all_data) * 0.8)
        self._train = all_data[:split_idx]
        self._test = all_data[split_idx:]

    def get_tasks(self, split="train", limit=10):
        data = self._train if split == "train" else self._test
        if limit:
            data = data[:limit]
        return [ae.Task(id=d["id"], input=d["input"]) for d in data]

Verify no overlap:

train_ids = {t.id for t in benchmark.get_tasks("train", limit=None)}
test_ids = {t.id for t in benchmark.get_tasks("test", limit=None)}
assert len(train_ids & test_ids) == 0, "Train/test overlap detected!"

A-Evolve Official Documentation Reference

This document consolidates key information from the official A-Evolve documentation

at github.com/A-EVO-Lab/a-evolve.

Project Overview
Installation Guide
Quick Start Guide
Architecture Overview
Agent Protocol
Benchmark Adapters
Evolution Engines
Workspace Contract
Configuration Reference
Built-in Agents
Built-in Benchmarks
Evolution Algorithms
Skill System
Memory System
Version Control
Observation Pipeline
FAQ

---

Project Overview

A-Evolve is the universal infrastructure for evolving AI agents through self-improvement. It enables automatic, data-driven optimization of agents across any domain using any evolution algorithm.

Design Principles

1. File-system as contract: All evolvable agent state lives as plain files in a workspace directory. No databases, no learned weights, no opaque parameters. Every mutation is an explicit edit to a text file.

2. Pluggable everything: Three interfaces — BaseAgent, BenchmarkAdapter, EvolutionEngine — enable any combination of agent, benchmark, and algorithm.

3. Git for versioning: Every evolution cycle creates git snapshots. Changes are diffable, rollbackable, and human-readable.

4. LLM-in-the-loop: The default evolution engine uses an LLM with bash tools to analyze observations and directly mutate workspace files. The evolver is itself an AI agent improving other AI agents.

5. Zero manual engineering: Once configured, evolution runs autonomously. The loop handles solving, evaluation, mutation, gating, and convergence detection.

Key Results

Using Claude Opus 4.6 as both the solver and evolver model:

Benchmark	Score	Leaderboard Position
MCP-Atlas	79.4%	#1
SWE-bench Verified	76.8%	~#5
Terminal-Bench 2.0	76.5%	~#7
SkillsBench	34.9%	#2

These results demonstrate that LLM-driven evolution of prompts, skills, and memory can produce state-of-the-art agent performance across diverse domains.

---

Installation Guide

Requirements

Python >= 3.11
Git (for workspace versioning)
An LLM API key (Anthropic, OpenAI, or AWS Bedrock credentials)

Installation Options

# Core package (matplotlib, pyyaml)
pip install a-evolve

# With specific LLM provider support
pip install a-evolve[anthropic]     # Anthropic Claude API
pip install a-evolve[openai]        # OpenAI API
pip install a-evolve[bedrock]       # AWS Bedrock (boto3)
pip install a-evolve[litellm]       # Multi-provider via LiteLLM

# With domain-specific dependencies
pip install a-evolve[swe]           # SWE-bench (strands-agents, datasets, swebench)
pip install a-evolve[mcp]           # MCP-Atlas (mcp, strands-agents, litellm)
pip install a-evolve[skillbench]    # SkillsBench (strands-agents)

# Everything
pip install a-evolve[all]

# Development
pip install a-evolve[dev]           # pytest, ruff, hypothesis

From Source

git clone https://github.com/A-EVO-Lab/a-evolve.git
cd a-evolve
pip install -e ".[all,dev]"

Verifying Installation

import agent_evolve as ae
print(ae.__version__)  # Should print version
print(ae.Evolver)      # Should print class reference

---

Quick Start Guide

3-Line Evolution

import agent_evolve as ae

evolver = ae.Evolver(agent="swe", benchmark="swe-verified")
results = evolver.run(cycles=10)
print(f"Final score: {results.final_score}")

This: 1. Copies the built-in SWE seed workspace to a working directory 2. Instantiates SweAgent from the workspace manifest 3. Runs 10 evolution cycles against SWE-bench Verified 4. Returns EvolutionResult with scores, convergence status, and details

With Custom Configuration

import agent_evolve as ae

config = ae.EvolveConfig(
    batch_size=15,              # 15 tasks per cycle
    max_cycles=25,              # Up to 25 evolution rounds
    evolve_prompts=True,        # Mutate system prompt
    evolve_skills=True,         # Discover and refine skills
    evolve_memory=True,         # Build episodic memory
    holdout_ratio=0.2,          # 20% held out for validation
    evolver_model="us.anthropic.claude-opus-4-6-v1",
    egl_threshold=0.02,         # Stop if < 2% improvement
    egl_window=5,               # Over 5 consecutive cycles
)

evolver = ae.Evolver(
    agent="swe",
    benchmark="swe-verified",
    config=config,
)
results = evolver.run()

# Inspect results
print(f"Cycles: {results.cycles_completed}")
print(f"Score: {results.final_score:.3f}")
print(f"Converged: {results.converged}")
print(f"Score history: {results.score_history}")

---

Architecture Overview

System Diagram

User Code (3 lines)
    │
    ▼
┌──────────────────────────────────────┐
│            Evolver API               │
│  - Resolves agent, benchmark, config │
│  - Creates EvolutionLoop             │
│  - Returns EvolutionResult           │
└──────────────┬───────────────────────┘
               │
    ┌──────────▼──────────┐
    │   EvolutionLoop     │
    │  For each cycle:    │
    │  1. Solve           │
    │  2. Observe         │
    │  3. Snapshot        │
    │  4. Evolve          │
    │  5. Snapshot        │
    │  6. Record          │
    │  7. Reload          │
    │  8. Converge?       │
    └──────────┬──────────┘
               │
    ┌──────────┼──────────┐
    │          │          │
    ▼          ▼          ▼
 Agent    Benchmark    Engine
solve()  evaluate()   step()
    │          │          │
    └──────────┼──────────┘
               │
               ▼
       Agent Workspace
       (filesystem + git)

Component Interactions

Forward flow (solve): 1. EvolutionLoop calls benchmark.get_tasks() to get a batch of tasks 2. For each task, calls agent.solve(task) to get a Trajectory 3. Calls benchmark.evaluate(task, trajectory) to get Feedback 4. Bundles into Observation(task, trajectory, feedback) triples

Evolution flow (mutate): 1. EvolutionLoop passes observations to engine.step() 2. Engine reads workspace files, analyzes observations 3. Engine mutates workspace files (prompts, skills, memory) 4. Returns StepResult(mutated, summary, metadata)

Reload flow (sync): 1. EvolutionLoop calls agent.reload_from_fs() 2. Agent re-reads prompts, skills, memory from workspace 3. Next cycle uses evolved state

---

Agent Protocol

BaseAgent Abstract Class

All evolvable agents inherit from BaseAgent:

from agent_evolve.protocol.base_agent import BaseAgent
from agent_evolve.types import Task, Trajectory

class MyAgent(BaseAgent):
    def __init__(self, workspace_dir: str):
        super().__init__(workspace_dir)
        # Initialize your LLM client, tools, etc.

    def solve(self, task: Task) -> Trajectory:
        """Solve a single task and return the trajectory.

        This is the only method you MUST override.
        """
        # Your solving logic here
        return Trajectory(
            task_id=task.id,
            output="solution",
            steps=[{"tool": "llm", "action": "generate"}],
        )

Agent Lifecycle

1. Construction: __init__(workspace_dir) — set up LLM client, load initial state 2. State loading: reload_from_fs() — read prompts, skills, memory from workspace 3. Solving: solve(task) — process one task, return trajectory 4. Memory buffering: remember(content, category) — store lessons during solve 5. State export: export_to_fs() — flush buffered memories and skill proposals 6. Hot reload: reload_from_fs() — re-read after evolution mutates files

Agent Properties

Property	Type	Description
`system_prompt`	`str`	Content of `prompts/system.md`
`skills`	`list[SkillMeta]`	Available skills from `skills/` directory
`memories`	`list[dict]`	Loaded episodic/semantic memories

Agent Best Practices

1. Always use `self.system_prompt` — don't hardcode prompts 2. Inject skills into LLM context — they're the primary evolution mechanism 3. Call `remember()` for reusable lessons — not for task-specific notes 4. Keep `solve()` deterministic when possible (temperature=0 for reproducibility) 5. Truncate trajectories — don't store full conversation if not needed for evolution

---

Benchmark Adapters

BenchmarkAdapter Abstract Class

from agent_evolve.benchmarks.base import BenchmarkAdapter
from agent_evolve.types import Task, Trajectory, Feedback

class MyBenchmark(BenchmarkAdapter):
    def get_tasks(self, split="train", limit=10):
        """Return tasks from the benchmark dataset.

        Args:
            split: "train" or "test" (for holdout evaluation)
            limit: Maximum number of tasks to return (default 10)
        """
        return [Task(id="1", input="task description")]

    def evaluate(self, task, trajectory):
        """Evaluate an agent's trajectory on a task.

        Returns Feedback with:
        - success: bool (binary pass/fail)
        - score: float (0.0 to 1.0 continuous)
        - detail: str (human-readable explanation)
        """
        return Feedback(success=True, score=0.9, detail="Passed 9/10 tests")

Benchmark Best Practices

1. Rich feedback details — the evolver reads feedback.detail to decide what to mutate 2. Deterministic evaluation — same input should produce same score 3. Diverse task coverage — include easy, medium, and hard tasks 4. Strict train/test split — no overlap between splits 5. Score granularity — continuous scores (0.0-1.0) are more useful than binary pass/fail

---

Evolution Engines

EvolutionEngine Abstract Class

from agent_evolve.engine.base import EvolutionEngine
from agent_evolve.types import StepResult

class MyEngine(EvolutionEngine):
    def step(self, workspace, observations, history, trial):
        """Mutate the workspace based on observations.

        Args:
            workspace: AgentWorkspace — typed I/O for agent files
            observations: list[Observation] — recent (task, trajectory, feedback) triples
            history: EvolutionHistory — query past cycles and workspace versions
            trial: TrialRunner — optional live evaluation runner

        Returns:
            StepResult with mutated flag, summary, and metadata
        """
        # Analyze observations, mutate workspace
        return StepResult(mutated=True, summary="Updated prompts")

    def on_cycle_end(self, accepted: bool, score: float):
        """Optional callback after gating decision."""
        pass

Engine Selection Guide

Engine	When to Use	Compute Cost
AEvolveEngine (default)	General-purpose, diverse domains	High (full LLM call)
GuidedSynthesisEngine	Skill discovery focus	Medium
AdaptiveEvolutionEngine	Noisy evaluation, fine control	Medium
AdaptiveSkillEngine	Skill-heavy domains	Medium
Custom	Domain-specific mutation logic	Variable

---

Workspace Contract

Directory Structure

workspace/
├── manifest.yaml              # Required: agent metadata
├── prompts/
│   ├── system.md              # Main system prompt
│   └── fragments/             # Modular prompt pieces
│       ├── reasoning.md
│       └── output_format.md
├── skills/
│   ├── _drafts/               # Proposed skills pending review
│   │   └── new-skill.md
│   └── verify-solution/       # Accepted skills
│       └── SKILL.md
├── tools/
│   ├── registry.yaml          # Tool manifest
│   └── custom_tool.py         # Tool implementations
├── memory/
│   ├── episodic.jsonl         # Failure lessons
│   └── semantic.jsonl         # Domain knowledge
└── evolution/                 # Managed by loop
    ├── observations/
    │   ├── batch_0001.jsonl
    │   └── batch_0002.jsonl
    ├── history.jsonl
    └── metrics.json

Manifest Format

agent:
  type: reference                                    # Must be "reference"
  entrypoint: my_package.agents.MyAgent              # Dotted Python path

evolvable_layers:                                    # Which directories can be mutated
  - prompts                                          # System prompt + fragments
  - skills                                           # Skill library
  - memory                                           # Episodic/semantic memory
  # - tools                                          # Tool implementations (optional)

reload_strategy: hot                                 # "hot" (re-read files) or "cold" (restart)

AgentWorkspace API

The AgentWorkspace class provides typed read/write access:

Prompts:

read_prompt() -> str — reads prompts/system.md
write_prompt(content: str) — writes prompts/system.md
read_fragment(name: str) -> str — reads prompts/fragments/{name}
write_fragment(name: str, content: str) — writes a fragment
list_fragments() -> list[str] — lists fragment filenames

Skills:

list_skills() -> list[SkillMeta] — lists skills with name, description, path
read_skill(name: str) -> str — reads skill content (frontmatter stripped)
write_skill(name: str, content: str) — writes or updates a skill
delete_skill(name: str) — removes a skill directory

Drafts:

list_drafts() -> list[dict] — lists pending skill proposals
write_draft(name: str, content: str) — writes a draft proposal
clear_drafts() — removes all pending drafts

Memory:

add_memory(entry: dict, category: str = "episodic") — appends to category JSONL
read_memories(category: str = "episodic", limit: int = 100) -> list[dict]
read_all_memories(limit: int = 100) -> list[dict] — all categories combined

Tools:

read_tool_registry() -> list[dict] — reads tools/registry.yaml
write_tool_registry(tools: list[dict]) — writes tool manifest
read_tool(name: str) -> str — reads tool source code
write_tool(name: str, content: str) — writes tool implementation

Evolution Metadata:

read_evolution_history() -> list[dict] — reads evolution/history.jsonl
read_evolution_metrics() -> dict — reads evolution/metrics.json

---

Configuration Reference

EvolveConfig Fields

Field	Type	Default	Description
`batch_size`	`int`	`10`	Tasks per solve round
`max_cycles`	`int`	`20`	Maximum evolution iterations
`holdout_ratio`	`float`	`0.2`	Fraction held out for validation
`evolve_prompts`	`bool`	`True`	Allow prompt mutation
`evolve_skills`	`bool`	`True`	Allow skill creation/modification
`evolve_memory`	`bool`	`True`	Allow memory writes
`evolve_tools`	`bool`	`False`	Allow tool implementation changes
`trajectory_only`	`bool`	`False`	Hide scores from evolver
`evolver_model`	`str`	`"us.anthropic.claude-opus-4-6-v1"`	LLM for evolution engine
`evolver_max_tokens`	`int`	`16384`	Max tokens for evolver calls
`egl_threshold`	`float`	`0.05`	Convergence epsilon
`egl_window`	`int`	`3`	Cycles for plateau detection
`extra`	`dict`	`{}`	Extension point for custom params

Loading from YAML

# config.yaml
batch_size: 15
max_cycles: 30
evolve_prompts: true
evolve_skills: true
evolve_memory: false
evolver_model: us.anthropic.claude-opus-4-6-v1
egl_threshold: 0.03
egl_window: 5
extra:
  solver_proposed: true
  merge_threshold: 0.7

config = ae.EvolveConfig.from_yaml("config.yaml")

Configuration Strategies

Conservative (stable improvement):

config = ae.EvolveConfig(
    batch_size=10,
    max_cycles=10,
    evolve_prompts=True,
    evolve_skills=False,
    evolve_memory=False,
    egl_threshold=0.05,
)

Aggressive (maximum exploration):

config = ae.EvolveConfig(
    batch_size=20,
    max_cycles=50,
    evolve_prompts=True,
    evolve_skills=True,
    evolve_memory=True,
    evolve_tools=True,
    egl_threshold=0.01,
    egl_window=7,
)

Skill-focused (procedure discovery):

config = ae.EvolveConfig(
    batch_size=10,
    max_cycles=25,
    evolve_prompts=False,
    evolve_skills=True,
    evolve_memory=True,
)

---

Built-in Agents

SWE Agent (`seed_workspaces/swe/`)

Domain: SWE-bench code patching Model: Claude Opus 4.6 via AWS Bedrock Framework: Strands-agents (CodeDojo-compatible)

Key features:

Verify-fix loop: runs tests before and after each edit
Hypothesis-first approach: form theory before exploring
Skill proposal generation: agent reflects on verification process
Conversation capture with per-turn token tracking
Dynamic tool loading from workspace tools/registry.yaml

Tools available: bash, submit, text_editor, python_exec

Terminal Agent (`seed_workspaces/terminal/`)

Domain: Terminal-Bench 2.0 shell challenges Model: Claude Sonnet 4 via AWS Bedrock Framework: Strands-agents

Key features:

Concurrent timeout enforcement via ThreadPoolExecutor
Test file copying only during evaluation (prevents cheating)
Pre-built skills: self-verification, environment-discovery, scientific-computing, debug-and-fix
Memory injection disabled (time-sensitive tasks)
Graceful timeout fallback

Tools available: bash, python, submit

MCP Agent (`seed_workspaces/mcp/`)

Domain: MCP-Atlas tool calling Model: Claude Opus 4.6 via AWS Bedrock Framework: Strands-agents with MCP integration

Key features:

MCP server connection management
Tool discovery and invocation
Multi-provider support via LiteLLM

---

Built-in Benchmarks

SWE-bench Verified

Module: agent_evolve.benchmarks.swe_verified Tasks: Real GitHub issues from popular Python repositories Evaluation: Runs test suite, checks if agent's patch fixes the issue Metric: Pass rate (0.0 to 1.0)

MCP-Atlas

Module: agent_evolve.benchmarks.mcp_atlas Tasks: Tool calling scenarios with MCP servers Evaluation: Checks correct tool selection and parameter passing Metric: Accuracy (0.0 to 1.0)

Terminal-Bench 2.0

Module: agent_evolve.benchmarks.terminal2 Tasks: Shell command challenges (file manipulation, system admin, scripting) Evaluation: Runs test scripts to verify terminal state Metric: Pass rate (0.0 to 1.0)

SkillsBench

Module: agent_evolve.benchmarks.skill_bench Tasks: Multi-step procedural tasks Evaluation: Checks step-by-step correctness Metric: Accuracy (0.0 to 1.0)

ARC-AGI-3

Module: agent_evolve.benchmarks.arc_agi3 Tasks: Interactive game levels (25 games, 181 levels) Evaluation: RHAE score (ratio of human to agent actions, squared) Metric: Average RHAE across levels (0.0 to 1.0)

---

Evolution Algorithms

AEvolveEngine (SkillForge)

Module: agent_evolve.algorithms.skillforge.engine Strategy: LLM-driven workspace mutation

The default engine gives an LLM full bash tool access to the workspace and asks it to improve the agent based on observations. This is the most flexible engine — it can make arbitrary changes to any workspace file.

Context provided to the LLM:

Recent observations (task inputs, agent outputs, feedback)
Current system prompt
Current skill library
Pending draft proposals
Score history

Mutation capabilities:

Edit system prompt (refine, consolidate, extend)
Create new skills from observed patterns
Merge overlapping skills
Write episodic memory entries
Review and curate draft proposals

GuidedSynthesisEngine

Module: agent_evolve.algorithms.guided_synth Strategy: Memory-first, curated skills

Emphasizes learning from failures before creating skills. Conservative approach that prevents skill bloat.

Process: 1. Extract lessons from failed tasks 2. Write episodic memory entries 3. After accumulating patterns, synthesize skill proposals 4. Curate proposals: ACCEPT, MERGE, or SKIP

AdaptiveEvolutionEngine

Module: agent_evolve.algorithms.adaptive Strategy: Reward tracking + observation filtering

Adjusts intervention intensity based on score trends. Makes smaller changes when improving, larger changes when plateaued.

AdaptiveSkillEngine

Module: agent_evolve.algorithms.adaptive_skill Strategy: Skill-centric discovery

Focuses exclusively on building the skill library. Identifies task categories where the agent fails and creates targeted skills.

---

Skill System

Skill File Format

---
name: verify-edge-cases
description: "TRIGGER when: checking boundary conditions. DO NOT TRIGGER: for happy-path tests."
---

## Pattern
Test all falsy-but-valid values: 0, False, "", [], {}

## Process
1. List all input boundaries
2. Run each against the implementation
3. Check both output AND side effects

Skill Discovery Process

1. Agent proposes: During solve(), agent writes draft to skills/_drafts/ 2. Engine reviews: During step(), engine reads drafts and decides:

ACCEPT: Move to skills/{name}/SKILL.md
MERGE: Combine with existing similar skill
SKIP: Discard (too narrow, redundant, or incorrect)

3. Engine creates: Engine can also create skills directly from observation analysis 4. Refinement: Existing skills are updated based on new observations

Skill Library Management

Target: 5-10 broad, reusable skills per workspace. Avoid:

30+ narrow skills (library bloat)
Skills that duplicate system prompt content
Skills with no TRIGGER condition (always-on = should be in prompt)

---

Memory System

Episodic Memory

Records specific lessons from task attempts:

{"content": "pytest --no-header flag needed for clean output", "category": "episodic", "task_id": "django-16379"}
{"content": "Off-by-one errors common in range() with len()", "category": "episodic", "task_id": "numpy-8823"}

Semantic Memory

General domain knowledge:

{"content": "Django uses reverse URL resolution via urlpatterns", "category": "semantic"}
{"content": "NumPy broadcasting rules: dimensions must match or be 1", "category": "semantic"}

Memory Limits

BaseAgent.reload_from_fs() loads up to 200 memory entries by default
AgentWorkspace.read_memories() defaults to limit=100
Old memories should be pruned or consolidated during evolution

---

Version Control

Git Tagging Convention

Tag	When Created	Purpose
`pre-evo-1`	Before cycle 1 evolution	Snapshot of solve-only state
`evo-1`	After cycle 1 evolution	Snapshot of evolved state
`pre-evo-2`	Before cycle 2 evolution	Snapshot before next mutation
`evo-2`	After cycle 2 evolution	Snapshot of evolved state

Useful Git Commands

# See all evolution checkpoints
git tag -l "evo-*"

# Compare two evolution stages
git diff evo-1 evo-10

# See what changed in a specific cycle
git diff pre-evo-5 evo-5

# Read a file at a specific point in time
git show evo-3:prompts/system.md

# Revert to a known good state
git checkout evo-5 -- .

---

Observation Pipeline

JSONL Format

Each observation is stored in evolution/observations/batch_{label}.jsonl:

{
  "task_id": "django__django-16379",
  "task_input": "Fix FileBasedCache has_key method...",
  "task_metadata": {},
  "agent_output": "--- a/django/core/cache/backends/filebased.py\n+++ ...",
  "steps": [
    {"tool": "bash", "action": "read_file", "file": "django/core/cache/backends/filebased.py"},
    {"tool": "text_editor", "action": "edit", "file": "django/core/cache/backends/filebased.py"}
  ],
  "success": true,
  "score": 1.0,
  "feedback_detail": "All 24 tests passed"
}

Querying Observations

history = EvolutionHistory("./my-workspace")

# All observations from last 3 cycles
recent = history.get_observations(last_n_cycles=3)

# Only failures
failures = history.get_observations(only_failures=True)

# Score curve
scores = history.get_score_curve()  # [(1, 0.62), (2, 0.68), ...]

---

FAQ

Can I use A-Evolve with any LLM?

Yes. The agent can use any LLM for solving. The evolver model is configurable via EvolveConfig.evolver_model. Supported providers: Anthropic (direct API), OpenAI, AWS Bedrock, LiteLLM (multi-provider).

Does evolution require training data?

No in the traditional ML sense. You need a BenchmarkAdapter that provides tasks and evaluation, but there are no training/gradient steps. Evolution is purely file-system mutation guided by LLM reasoning.

How many cycles should I run?

Start with 10 cycles and check convergence. If score is still improving, run more. Default convergence detection (egl_threshold=0.05, egl_window=3) stops automatically when improvement plateaus.

Can I resume evolution after stopping?

Yes. The workspace retains its evolved state. Create a new Evolver pointing to the same workspace and call run() again.

Is evolution deterministic?

No. LLM calls are inherently non-deterministic. Running the same config twice may produce different evolved agents with similar final scores.

Can I evolve multiple agents simultaneously?

Yes, but each must have its own workspace directory. The evolution loop modifies workspace files directly, so concurrent access to the same workspace is not safe.

What's the cost per evolution cycle?

Each cycle involves: (batch_size) agent solve calls + 1 evolver call. For batch_size=10 with Claude, expect ~$5-20 per cycle depending on task complexity and model used.

Can I use A-Evolve without a benchmark?

Not directly. The evolution loop requires BenchmarkAdapter.evaluate() to produce Feedback. However, you can implement a custom benchmark that uses human evaluation, LLM-as-judge, or any other scoring mechanism.

A-Evolve Release History

v0.1.0 — Initial Public Release

Date: 2025

Highlights:

Universal agent evolution infrastructure
Three pluggable interfaces: BaseAgent, BenchmarkAdapter, EvolutionEngine
File-system workspace contract with git versioning
Four built-in evolution algorithms

Benchmark Results (Claude Opus 4.6):

MCP-Atlas: 79.4% (#1 on leaderboard)
SWE-bench Verified: 76.8% (~#5 on leaderboard)
Terminal-Bench 2.0: 76.5% (~#7 on leaderboard)
SkillsBench: 34.9% (#2 on leaderboard)

Core Components

Agent Protocol (agent_evolve.protocol.base_agent):

BaseAgent abstract class with solve(), reload_from_fs(), export_to_fs()
Memory buffering via remember()
Skill access via get_skill_content()
Properties: system_prompt, skills, memories

Benchmark Adapter (agent_evolve.benchmarks.base):

BenchmarkAdapter abstract class with get_tasks() and evaluate()
Built-in adapters: SWE-bench Verified, MCP-Atlas, Terminal-Bench 2.0, SkillsBench, ARC-AGI-3

Evolution Engine (agent_evolve.engine.base):

EvolutionEngine abstract class with step() and on_cycle_end()
Default engine: AEvolveEngine (LLM-driven workspace mutation via bash tools)
Additional engines: GuidedSynthesisEngine, AdaptiveEvolutionEngine, AdaptiveSkillEngine

Evolution Loop (agent_evolve.engine.loop):

Orchestrates solve → observe → evolve → gate → reload cycles
Git snapshot versioning (pre-evo-N, evo-N tags)
Convergence detection with configurable threshold and window
JSONL observation storage

Agent Workspace (agent_evolve.contract.workspace):

AgentWorkspace class for typed file I/O
Prompt read/write (system.md + fragments)
Skill CRUD (list, read, write, delete)
Draft management (propose, list, clear)
Memory management (add, read by category)
Tool registry and implementation management
Evolution metadata access

Configuration (agent_evolve.config):

EvolveConfig dataclass with YAML loading
Controls: batch_size, max_cycles, holdout_ratio
Layer toggles: evolve_prompts, evolve_skills, evolve_memory, evolve_tools
Evolver model configuration (supports Anthropic, OpenAI, Bedrock, LiteLLM)
Convergence: egl_threshold (default 0.05), egl_window (default 3)

Top-Level API (agent_evolve.api):

Evolver class: 3-line setup and run
Auto-resolution of agent seeds and benchmark names
Workspace copying and manifest validation

Built-in Seed Agents

Agent	Domain	Framework	Model
SWE Agent	SWE-bench	Strands	Claude Opus 4.6 (Bedrock)
Terminal Agent	Terminal-Bench	Strands	Claude Sonnet 4 (Bedrock)
MCP Agent	MCP-Atlas	Strands	Claude Opus 4.6 (Bedrock)

Evolution Algorithms

Algorithm	Module	Strategy
A-Evolve/SkillForge	`algorithms.skillforge`	LLM with bash tools mutates workspace
Guided Synthesis	`algorithms.guided_synth`	Memory-first, curated skill proposals
Adaptive Evolution	`algorithms.adaptive`	Reward tracking, observation filtering
Adaptive Skill	`algorithms.adaptive_skill`	Skill-centric discovery and refinement

Installation Options

pip install a-evolve                # Core (matplotlib, pyyaml)
pip install a-evolve[anthropic]     # + anthropic>=0.30
pip install a-evolve[openai]        # + openai>=1.30
pip install a-evolve[bedrock]       # + boto3>=1.34
pip install a-evolve[litellm]       # + litellm>=1.0.0
pip install a-evolve[swe]           # + strands-agents, datasets, swebench
pip install a-evolve[mcp]           # + mcp, strands-agents, litellm
pip install a-evolve[all]           # Everything
pip install a-evolve[dev]           # + pytest, ruff, hypothesis

Requirements

Python >= 3.11
Core dependencies: matplotlib >= 3.10.0, pyyaml >= 6.0
Git (for workspace versioning)

Known Limitations

Evolution loop is single-threaded (sequential cycles)
Convergence check uses hardcoded epsilon=0.01 in loop internals vs configurable egl_threshold in EvolveConfig
No built-in distributed evaluation (parallelize via external orchestration)
Workspace versioning requires git; non-git workflows not supported

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

API reference for an evolution engine—not a single-turn codegen skill or an MCP data connector.

FAQ

Who is evolving-ai-agents for?

Developers building or tuning autonomous agents who need the `agent_evolve` Evolver API, built-in seeds, and benchmark adapters spelled out in one place.

When should I use evolving-ai-agents?

In Build when wiring evolution into your agent repo; in Ship when regression-testing evolved layers against swe-verified or mcp-atlas; in Operate when iterating cycles after production failures.

Is evolving-ai-agents safe to install?

Evolution runs local workspaces and benchmarks that may execute agent code; review the Security Audits panel on this page and sandbox runs before trusting evolved entrypoints.

AI & Agent Buildingagentsautomation

About

Evolving Ai Agents by the numbers

Add your badge

What it does

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Evolving AI Agents with A-Evolve

Overview

When to Use A-Evolve

Quick Start

Installation

Three-Line Evolution

Core Concepts

The Agent Workspace

The Evolution Loop

Three Pluggable Interfaces

Workflow 1: Evolve an Existing Agent

Steps

Post-Evolution

Workflow 2: Add a Custom Benchmark

Steps

Workflow 3: Create a Custom Evolution Engine

Steps

Built-in Components

Seed Agents

Benchmarks

Evolution Algorithms

Configuration Reference

Skill Format

Common Issues

Evolution score plateaus early

Agent workspace grows too large

Git conflicts during evolution

LLM provider errors during evolution

Custom agent not picking up evolved state

Usage Instructions for Agents

References

A-Evolve API Reference

Top-Level Module: agent_evolve

ae.Evolver

Core Types: agent_evolve.types

Task

Trajectory

Feedback

Observation

SkillMeta

StepResult

CycleRecord

EvolutionResult

Protocol: agent_evolve.protocol.base_agent

BaseAgent

Benchmarks: agent_evolve.benchmarks.base

BenchmarkAdapter

Engine: agent_evolve.engine.base

EvolutionEngine

Configuration: agent_evolve.config

EvolveConfig

Workspace: agent_evolve.contract.workspace

AgentWorkspace

Built-in Algorithms

agent_evolve.algorithms.skillforge.engine.AEvolveEngine

agent_evolve.algorithms.guided_synth.GuidedSynthesisEngine

agent_evolve.algorithms.adaptive.AdaptiveEvolutionEngine

agent_evolve.algorithms.adaptive_skill.AdaptiveSkillEngine

Built-in Registries

Evolution Loop: agent_evolve.engine.loop

EvolutionLoop

Convergence Function

Observer: agent_evolve.engine.observer

Observer

EvolutionHistory

Version Control: agent_evolve.engine.versioning

VersionControl

Skill Format Specification

Skill Lifecycle

Skill Loading

Skill Injection Patterns

Top-Level Module: `agent_evolve`

`ae.Evolver`

Core Types: `agent_evolve.types`

`Task`

`Trajectory`

`Feedback`

`Observation`

`SkillMeta`

`StepResult`

`CycleRecord`

`EvolutionResult`

Protocol: `agent_evolve.protocol.base_agent`

`BaseAgent`

Benchmarks: `agent_evolve.benchmarks.base`

`BenchmarkAdapter`

Engine: `agent_evolve.engine.base`

`EvolutionEngine`

Configuration: `agent_evolve.config`

`EvolveConfig`

Workspace: `agent_evolve.contract.workspace`

`AgentWorkspace`

`agent_evolve.algorithms.skillforge.engine.AEvolveEngine`

`agent_evolve.algorithms.guided_synth.GuidedSynthesisEngine`

`agent_evolve.algorithms.adaptive.AdaptiveEvolutionEngine`

`agent_evolve.algorithms.adaptive_skill.AdaptiveSkillEngine`

Evolution Loop: `agent_evolve.engine.loop`

`EvolutionLoop`

Observer: `agent_evolve.engine.observer`

`Observer`

`EvolutionHistory`

Version Control: `agent_evolve.engine.versioning`

`VersionControl`