Ce Optimize

Name: Ce Optimize
Author: everyinc

everyinc/compound-engineering-plugin

Run bounded, metric-driven optimization loops (build latency, relevance, etc.) with hard gates so agents improve code without breaking CI.

Overview

ce-optimize is an agent skill most often used in Ship (also Build backend, Operate iterate) that drives bounded optimization loops against objective metrics with CI-style gates and scoped file edits.

Install

npx skills add https://github.com/everyinc/compound-engineering-plugin --skill ce-optimize

What is this skill?

YAML recipe pattern: primary hard metric, degenerate gates (build_passed, test_pass_rate), and diagnostic side metrics
Measurement harness with repeat runs, median aggregation, and noise threshold (e.g. 3 repeats, 0.05 noise)
Explicit mutable vs immutable scope so eval scripts and fixtures stay trusted
Stopping rules: max iterations, plateau detection, hours cap, and target_reached
Serial worktree execution mode with max_concurrent for controlled agent edits
Default max_iterations: 4 in latency template
Measurement stability: repeat_count 3 with median aggregation
Default noise_threshold: 0.05

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1.6k installs on skills.sh; 20.5k GitHub stars; 2/3 security scanners passed (skills.sh audits).

What problem does it solve?

You want the agent to make the build or quality metric better but every ad-hoc pass risks breaking tests or editing eval infrastructure.

Who is it for?

Indie devs with an evaluate.py (or equivalent) harness optimizing build_seconds, artifact size, or similar hard metrics.

Skip if: Greenfield apps with no tests, no measurement command, or goals that need purely subjective judgment without an eval loop.

When should I use this skill?

Reduce build latency without regressing correctness, or improve another scalar metric defined with hard gates and a measurement command.

What do I get? / Deliverables

You get a repeatable optimization batch where only scoped paths change, hard gates stay green, and iteration stops on plateau or target—ready to merge runner-up patches only when your template allows.

Optimized scoped source/config changes
Measurement run logs under stability settings
Stopped batch respecting plateau or target rules

Recommended Skills

Agent Browservercel-labs/agent-browser

agent-browser is a Node-installed browser automation CLI built for AI agents that need dependable programmatic web inter…428k installs·35.5k stars

Lark Imlarksuite/cli

Lark IM is a Larksuite agent skill that exposes Feishu/Lark instant messaging to Claude Code, Cursor, and similar agents…210k installs·13.7k stars

Lark Calendarlarksuite/cli

lark-calendar is an agent skill for Feishu/Lark Calendar v4 exposed via lark-cli. Solo builders and small teams who alre…209k installs·13.7k stars

Lark Sheetslarksuite/cli

Skill for programmatic Feishu spreadsheet and worksheet management—create tables, bulk data IO, lookup, and export—using…209k installs·13.7k stars

Lark Vclarksuite/cli

lark-vc is an agent skill for Feishu/Lark video conferencing history and artifacts through lark-cli. After calls end, so…208k installs·13.7k stars

Lark Contactlarksuite/cli

CLI skill for Lark directory lookup: search employees and fetch metadata by open_id, with clear boundaries vs IM, calend…208k installs·13.7k stars

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Ship/perf is the canonical shelf because the bundled templates optimize measurable delivery outcomes—build seconds, pass rates—right before or after release pressure. Performance and build-time tuning match perf subphase; the skill is about iterative improvement under measurement, not greenfield feature coding.

Also useful

BuildBackend, data & payments

Also useful

OperateIteration & experiments

Where it fits

Example use

BuildBackend, data & payments

Tune build.yaml and src/build/ while immutable eval fixtures enforce correctness.

Example use

ShipPerformance

Minimize build_seconds with repeat_count 3 median aggregation before a release branch.

Example use

OperateIteration & experiments

Plateau-aware stopping after max_iterations when nightly builds creep upward.

How it compares

Metric-gated agent optimization recipe—not a one-off “refactor for speed” chat nor a hosted A/B platform.

Common Questions / FAQ

Who is ce-optimize for?

Solo builders using compound-engineering-style agents who already have tests and a measurement script and want structured improve-build-latency-style runs.

When should I use ce-optimize?

Use it in Ship perf when build time blocks releases, in Build backend when tuning build pipelines under test gates, or in Operate iterate when tightening tooling loops—as soon as you have a primary metric and degenerate gates defined in YAML.

Is ce-optimize safe to install?

It implies shell execution and repo edits within mutable scope—review Security Audits on this page and lock immutable paths (eval scripts, fixtures, CI scripts) before trusting automated merges.

SKILL.md

READMESKILL.md - Ce Optimize

# Minimal first-run template for objective metrics.
# Start here when "better" is a scalar value from the measurement harness.

name: improve-build-latency
description: Reduce build latency without regressing correctness

metric:
  primary:
    type: hard
    name: build_seconds
    direction: minimize
  degenerate_gates:
    - name: build_passed
      check: "== 1"
      description: The build must stay green
    - name: test_pass_rate
      check: ">= 1.0"
      description: Required tests must keep passing
  diagnostics:
    - name: artifact_size_mb
    - name: peak_memory_mb

measurement:
  command: "python evaluate.py"
  timeout_seconds: 300
  working_directory: "tools/eval"
  stability:
    mode: repeat
    repeat_count: 3
    aggregation: median
    noise_threshold: 0.05

scope:
  mutable:
    - "src/build/"
    - "config/build.yaml"
  immutable:
    - "tools/eval/evaluate.py"
    - "tests/fixtures/"
    - "scripts/ci/"

execution:
  mode: serial
  backend: worktree
  max_concurrent: 1

parallel:
  port_strategy: none
  shared_files: []

dependencies:
  approved: []

constraints:
  - "Keep output artifacts backward compatible"
  - "Do not skip required validation steps"

stopping:
  max_iterations: 4
  max_hours: 1
  plateau_iterations: 3
  target_reached: true

max_runner_up_merges_per_batch: 0


# Minimal first-run template for qualitative metrics.
# Start here when true quality requires semantic judgment, not a proxy metric.

name: improve-search-relevance
description: Improve semantic relevance of search results without obvious failures

metric:
  primary:
    type: judge
    name: mean_score
    direction: maximize
  degenerate_gates:
    - name: result_count
      check: ">= 5"
      description: Return enough results to judge quality
    - name: empty_query_failures
      check: "== 0"
      description: Empty or trivial queries must not fail
  diagnostics:
    - name: latency_ms
    - name: recall_at_10
  judge:
    rubric: |
      Rate each result set from 1-5 for relevance:
      - 5: Results are directly relevant and well ordered
      - 4: Mostly relevant with minor ordering issues
      - 3: Mixed relevance or one obvious miss
      - 2: Weak relevance, several misses, or poor ordering
      - 1: Mostly irrelevant
      Also report: ambiguous (boolean)
    scoring:
      primary: mean_score
      secondary:
        - ambiguous_rate
    model: haiku
    sample_size: 10
    batch_size: 5
    sample_seed: 42
    minimum_improvement: 0.2
    max_total_cost_usd: 5

measurement:
  command: "python eval_search.py"
  timeout_seconds: 300
  working_directory: "tools/eval"

scope:
  mutable:
    - "src/search/"
    - "config/search.yaml"
  immutable:
    - "tools/eval/eval_search.py"
    - "tests/fixtures/"
    - "docs/"

execution:
  mode: serial
  backend: worktree
  max_concurrent: 1

parallel:
  port_strategy: none
  shared_files: []

dependencies:
  approved: []

constraints:
  - "Preserve the existing search response shape"
  - "Do not add new dependencies on the first run"

stopping:
  max_iterations: 4
  max_hours: 1
  plateau_iterations: 3
  target_reached: true

max_runner_up_merges_per_batch: 0


# Experiment Log Schema
# This is the canonical schema for the experiment log file that accumulates
# across an optimization run.
#
# Location: .context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
#
# PERSISTENCE MODEL:
# The experiment log on disk is the SINGLE SOURCE OF TRUTH. The agent's
# in-memory context is expendable and will be compacted during long runs.
#
# Write discipline:
# - Each experiment entry is APPENDED immediately after its measurement
#   completes (SKILL.md step 3.3), before batch evaluation
# - Outcome fields may be updated in-place after batch evaluation (step 3.5)
# - The `best` section is updated after each batch if a new best is found
# - The `hypothesis_backlog` is updated after each batch
# - The agent re-reads this file from disk at every phase boundary
#
# The or

What is this skill?

YAML recipe pattern: primary hard metric, degenerate gates (build_passed, test_pass_rate), and diagnostic side metrics

Measurement harness with repeat runs, median aggregation, and noise threshold (e.g. 3 repeats, 0.05 noise)

Explicit mutable vs immutable scope so eval scripts and fixtures stay trusted

Stopping rules: max iterations, plateau detection, hours cap, and target_reached

Serial worktree execution mode with max_concurrent for controlled agent edits

Default max_iterations: 4 in latency template

Measurement stability: repeat_count 3 with median aggregation

Default noise_threshold: 0.05

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 1.6k installs on skills.sh; 20.5k GitHub stars; 2/3 security scanners passed (skills.sh audits).

What do I get? / Deliverables

Optimized scoped source/config changes

Measurement run logs under stability settings

Stopped batch respecting plateau or target rules

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

BuildBackend, data & payments

Also useful

OperateIteration & experiments

Where it fits

Example use

BuildBackend, data & payments

Tune build.yaml and src/build/ while immutable eval fixtures enforce correctness.

Example use

ShipPerformance

Minimize build_seconds with repeat_count 3 median aggregation before a release branch.

Example use

OperateIteration & experiments

Plateau-aware stopping after max_iterations when nightly builds creep upward.

SKILL.md

READMESKILL.md - Ce Optimize

# Minimal first-run template for objective metrics.
# Start here when "better" is a scalar value from the measurement harness.

name: improve-build-latency
description: Reduce build latency without regressing correctness

metric:
  primary:
    type: hard
    name: build_seconds
    direction: minimize
  degenerate_gates:
    - name: build_passed
      check: "== 1"
      description: The build must stay green
    - name: test_pass_rate
      check: ">= 1.0"
      description: Required tests must keep passing
  diagnostics:
    - name: artifact_size_mb
    - name: peak_memory_mb

measurement:
  command: "python evaluate.py"
  timeout_seconds: 300
  working_directory: "tools/eval"
  stability:
    mode: repeat
    repeat_count: 3
    aggregation: median
    noise_threshold: 0.05

scope:
  mutable:
    - "src/build/"
    - "config/build.yaml"
  immutable:
    - "tools/eval/evaluate.py"
    - "tests/fixtures/"
    - "scripts/ci/"

execution:
  mode: serial
  backend: worktree
  max_concurrent: 1

parallel:
  port_strategy: none
  shared_files: []

dependencies:
  approved: []

constraints:
  - "Keep output artifacts backward compatible"
  - "Do not skip required validation steps"

stopping:
  max_iterations: 4
  max_hours: 1
  plateau_iterations: 3
  target_reached: true

max_runner_up_merges_per_batch: 0


# Minimal first-run template for qualitative metrics.
# Start here when true quality requires semantic judgment, not a proxy metric.

name: improve-search-relevance
description: Improve semantic relevance of search results without obvious failures

metric:
  primary:
    type: judge
    name: mean_score
    direction: maximize
  degenerate_gates:
    - name: result_count
      check: ">= 5"
      description: Return enough results to judge quality
    - name: empty_query_failures
      check: "== 0"
      description: Empty or trivial queries must not fail
  diagnostics:
    - name: latency_ms
    - name: recall_at_10
  judge:
    rubric: |
      Rate each result set from 1-5 for relevance:
      - 5: Results are directly relevant and well ordered
      - 4: Mostly relevant with minor ordering issues
      - 3: Mixed relevance or one obvious miss
      - 2: Weak relevance, several misses, or poor ordering
      - 1: Mostly irrelevant
      Also report: ambiguous (boolean)
    scoring:
      primary: mean_score
      secondary:
        - ambiguous_rate
    model: haiku
    sample_size: 10
    batch_size: 5
    sample_seed: 42
    minimum_improvement: 0.2
    max_total_cost_usd: 5

measurement:
  command: "python eval_search.py"
  timeout_seconds: 300
  working_directory: "tools/eval"

scope:
  mutable:
    - "src/search/"
    - "config/search.yaml"
  immutable:
    - "tools/eval/eval_search.py"
    - "tests/fixtures/"
    - "docs/"

execution:
  mode: serial
  backend: worktree
  max_concurrent: 1

parallel:
  port_strategy: none
  shared_files: []

dependencies:
  approved: []

constraints:
  - "Preserve the existing search response shape"
  - "Do not add new dependencies on the first run"

stopping:
  max_iterations: 4
  max_hours: 1
  plateau_iterations: 3
  target_reached: true

max_runner_up_merges_per_batch: 0


# Experiment Log Schema
# This is the canonical schema for the experiment log file that accumulates
# across an optimization run.
#
# Location: .context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
#
# PERSISTENCE MODEL:
# The experiment log on disk is the SINGLE SOURCE OF TRUTH. The agent's
# in-memory context is expendable and will be compacted during long runs.
#
# Write discipline:
# - Each experiment entry is APPENDED immediately after its measurement
#   completes (SKILL.md step 3.3), before batch evaluation
# - Outcome fields may be updated in-place after batch evaluation (step 3.5)
# - The `best` section is updated after each batch if a new best is found
# - The `hypothesis_backlog` is updated after each batch
# - The agent re-reads this file from disk at every phase boundary
#
# The or

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is ce-optimize for?

When should I use ce-optimize?

Is ce-optimize safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is ce-optimize for?

When should I use ce-optimize?

Is ce-optimize safe to install?

SKILL.md