
Ce Optimize
Run bounded, metric-driven optimization loops (build latency, relevance, etc.) with hard gates so agents improve code without breaking CI.
Overview
ce-optimize is an agent skill most often used in Ship (also Build backend, Operate iterate) that drives bounded optimization loops against objective metrics with CI-style gates and scoped file edits.
Install
npx skills add https://github.com/everyinc/compound-engineering-plugin --skill ce-optimizeWhat is this skill?
- YAML recipe pattern: primary hard metric, degenerate gates (build_passed, test_pass_rate), and diagnostic side metrics
- Measurement harness with repeat runs, median aggregation, and noise threshold (e.g. 3 repeats, 0.05 noise)
- Explicit mutable vs immutable scope so eval scripts and fixtures stay trusted
- Stopping rules: max iterations, plateau detection, hours cap, and target_reached
- Serial worktree execution mode with max_concurrent for controlled agent edits
- Default max_iterations: 4 in latency template
- Measurement stability: repeat_count 3 with median aggregation
- Default noise_threshold: 0.05
Adoption & trust: 1.6k installs on skills.sh; 20.5k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You want the agent to make the build or quality metric better but every ad-hoc pass risks breaking tests or editing eval infrastructure.
Who is it for?
Indie devs with an evaluate.py (or equivalent) harness optimizing build_seconds, artifact size, or similar hard metrics.
Skip if: Greenfield apps with no tests, no measurement command, or goals that need purely subjective judgment without an eval loop.
When should I use this skill?
Reduce build latency without regressing correctness, or improve another scalar metric defined with hard gates and a measurement command.
What do I get? / Deliverables
You get a repeatable optimization batch where only scoped paths change, hard gates stay green, and iteration stops on plateau or target—ready to merge runner-up patches only when your template allows.
- Optimized scoped source/config changes
- Measurement run logs under stability settings
- Stopped batch respecting plateau or target rules
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Ship/perf is the canonical shelf because the bundled templates optimize measurable delivery outcomes—build seconds, pass rates—right before or after release pressure. Performance and build-time tuning match perf subphase; the skill is about iterative improvement under measurement, not greenfield feature coding.
Where it fits
Tune build.yaml and src/build/ while immutable eval fixtures enforce correctness.
Minimize build_seconds with repeat_count 3 median aggregation before a release branch.
Plateau-aware stopping after max_iterations when nightly builds creep upward.
How it compares
Metric-gated agent optimization recipe—not a one-off “refactor for speed” chat nor a hosted A/B platform.
Common Questions / FAQ
Who is ce-optimize for?
Solo builders using compound-engineering-style agents who already have tests and a measurement script and want structured improve-build-latency-style runs.
When should I use ce-optimize?
Use it in Ship perf when build time blocks releases, in Build backend when tuning build pipelines under test gates, or in Operate iterate when tightening tooling loops—as soon as you have a primary metric and degenerate gates defined in YAML.
Is ce-optimize safe to install?
It implies shell execution and repo edits within mutable scope—review Security Audits on this page and lock immutable paths (eval scripts, fixtures, CI scripts) before trusting automated merges.
SKILL.md
READMESKILL.md - Ce Optimize
# Minimal first-run template for objective metrics. # Start here when "better" is a scalar value from the measurement harness. name: improve-build-latency description: Reduce build latency without regressing correctness metric: primary: type: hard name: build_seconds direction: minimize degenerate_gates: - name: build_passed check: "== 1" description: The build must stay green - name: test_pass_rate check: ">= 1.0" description: Required tests must keep passing diagnostics: - name: artifact_size_mb - name: peak_memory_mb measurement: command: "python evaluate.py" timeout_seconds: 300 working_directory: "tools/eval" stability: mode: repeat repeat_count: 3 aggregation: median noise_threshold: 0.05 scope: mutable: - "src/build/" - "config/build.yaml" immutable: - "tools/eval/evaluate.py" - "tests/fixtures/" - "scripts/ci/" execution: mode: serial backend: worktree max_concurrent: 1 parallel: port_strategy: none shared_files: [] dependencies: approved: [] constraints: - "Keep output artifacts backward compatible" - "Do not skip required validation steps" stopping: max_iterations: 4 max_hours: 1 plateau_iterations: 3 target_reached: true max_runner_up_merges_per_batch: 0 # Minimal first-run template for qualitative metrics. # Start here when true quality requires semantic judgment, not a proxy metric. name: improve-search-relevance description: Improve semantic relevance of search results without obvious failures metric: primary: type: judge name: mean_score direction: maximize degenerate_gates: - name: result_count check: ">= 5" description: Return enough results to judge quality - name: empty_query_failures check: "== 0" description: Empty or trivial queries must not fail diagnostics: - name: latency_ms - name: recall_at_10 judge: rubric: | Rate each result set from 1-5 for relevance: - 5: Results are directly relevant and well ordered - 4: Mostly relevant with minor ordering issues - 3: Mixed relevance or one obvious miss - 2: Weak relevance, several misses, or poor ordering - 1: Mostly irrelevant Also report: ambiguous (boolean) scoring: primary: mean_score secondary: - ambiguous_rate model: haiku sample_size: 10 batch_size: 5 sample_seed: 42 minimum_improvement: 0.2 max_total_cost_usd: 5 measurement: command: "python eval_search.py" timeout_seconds: 300 working_directory: "tools/eval" scope: mutable: - "src/search/" - "config/search.yaml" immutable: - "tools/eval/eval_search.py" - "tests/fixtures/" - "docs/" execution: mode: serial backend: worktree max_concurrent: 1 parallel: port_strategy: none shared_files: [] dependencies: approved: [] constraints: - "Preserve the existing search response shape" - "Do not add new dependencies on the first run" stopping: max_iterations: 4 max_hours: 1 plateau_iterations: 3 target_reached: true max_runner_up_merges_per_batch: 0 # Experiment Log Schema # This is the canonical schema for the experiment log file that accumulates # across an optimization run. # # Location: .context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml # # PERSISTENCE MODEL: # The experiment log on disk is the SINGLE SOURCE OF TRUTH. The agent's # in-memory context is expendable and will be compacted during long runs. # # Write discipline: # - Each experiment entry is APPENDED immediately after its measurement # completes (SKILL.md step 3.3), before batch evaluation # - Outcome fields may be updated in-place after batch evaluation (step 3.5) # - The `best` section is updated after each batch if a new best is found # - The `hypothesis_backlog` is updated after each batch # - The agent re-reads this file from disk at every phase boundary # # The or