
Eval Driven Dev
Ground eval harnesses, traces, and datasets in real product context before you instrument code or run agent benchmarks.
Overview
Eval-driven-dev is an agent skill most often used in Ship (also Validate and Build) that analyzes what your software does in the real world before you choose entry points, eval criteria, traces, or datasets.
Install
npx skills add https://github.com/github/awesome-copilot --skill eval-driven-devWhat is this skill?
- Five-question project analysis frame: purpose, users, capability inventory, quality bar, and real-world success criteria
- README-first capability inventory so each mode can map to its own entry point, trace inputs, and dataset rows
- Plain-language product summary that downstream steps use for eval criteria and trace design
- Explicitly runs before code structure, entry points, or instrumentation
- Tailors “quality” to domain (customer chatbot vs multi-source research agent vs scraping library)
- Five core investigation questions in Step 1a project analysis
- Capability-inventory examples span scraping, voice agents, and research agents
Adoption & trust: 2.9k installs on skills.sh; 34.6k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You are ready to instrument and benchmark your agent or app but have not written down who uses it, which capabilities matter, and what a successful run looks like in plain language.
Who is it for?
Solo builders adding evals, traces, or benchmarks to an existing agent, API, or multi-mode product with README or docs available.
Skip if: Teams that already have frozen eval specs and signed-off datasets and only need a one-line code change with no scope review.
When should I use this skill?
Before choosing code entry points, instrumentation, eval criteria, trace inputs, or dataset entries for an agent or multi-mode app.
What do I get? / Deliverables
You get a one-paragraph product summary, user context, and a capability inventory that every later eval, trace, and dataset step can align to.
- One-paragraph plain-language product summary
- User and use-case notes that define what quality means
- Capability inventory mapping modes to future entry points and eval surfaces
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Eval-driven development is how solo builders prove agent quality before release; the canonical shelf is Ship → testing because outputs are eval criteria, traces, and datasets—not one-off feature code. Step 1a is project analysis that precedes instrumentation and test design, which is the same mental model as structured QA and eval design in the testing subphase.
Where it fits
List distinct agent modes (FAQ, transfer, summarization) so your MVP eval plan only covers capabilities you will ship.
Prioritize which entry points need traces first based on the capability inventory instead of instrumenting every file.
Turn user personas and success criteria into eval rubrics and dataset rows before running parallel agent runs.
Revisit the inventory when new features ship so production analytics and eval sets stay aligned with live capabilities.
How it compares
Use instead of jumping straight into pytest or ad-hoc agent chats when you need evaluation design tied to real product capabilities.
Common Questions / FAQ
Who is eval-driven-dev for?
Indie and solo developers shipping AI agents, copilots, or automation who need eval criteria grounded in how the product is used, not generic template tests.
When should I use eval-driven-dev?
During Validate when defining what to prove, during Build when planning agent entry points and observability, and during Ship before you write instrumentation, trace captures, or benchmark datasets.
Is eval-driven-dev safe to install?
It is procedural analysis guidance; review the Security Audits panel on this Prism page and treat repo reads like any other skill that inspects your codebase.
SKILL.md
READMESKILL.md - Eval Driven Dev
# Step 1a: Project Analysis Before looking at code structure, entry points, or writing any instrumentation, understand what this software does in the real world. This analysis is the foundation for every subsequent step — it determines which entry points to prioritize, what eval criteria to define, what trace inputs to use, and what dataset entries to build. --- ## What to investigate Read the project's README, documentation, and top-level source files. You're looking for answers to five questions: ### 1. What does this software do? Write a one-paragraph plain-language summary. What problem does it solve? What does a successful run look like? ### 2. Who uses it and why? Who are the target users? What's the primary use case? What problem does this solve that alternatives don't? This helps you understand what "quality" means for this app — a chatbot that chats with customers has different quality requirements than a research agent that synthesises multi-source reports. ### 3. Capability inventory List the distinct capabilities, modes, or features the app offers. Be specific. for example: - For a scraping library: single-page scraping, multi-page scraping, search-based scraping, speech output, script generation - For a voice agent: greeting, FAQ handling, account lookup, transfer to human, call summarization - For a research agent: topic research, multi-source synthesis, citation generation, report formatting Each capability may need its own entry point, its own trace, and its own dataset entries. This list directly feeds Step 1c (use cases) and Step 4 (dataset diversity). ### 4. What are realistic inputs? Characterize the real-world inputs the app processes — not toy examples: - For a web scraper: "messy HTML pages with navigation, ads, dynamic content, tables, nested structures — typically 5KB-500KB of HTML" - For a research agent: "open-ended research questions requiring multi-source synthesis, with 3-10 sub-questions" - For a voice agent: "multi-turn conversations with background noise, interruptions, and ambiguous requests" Be specific about **scale** (how large), **complexity** (how messy/diverse), and **variety** (what kinds). This directly feeds trace input selection (Step 2) — if you don't characterize realistic inputs here, you'll end up using toy inputs that bypass the app's real logic. **This section is an operational constraint, not just documentation.** Steps 2c (trace input) and 4c (dataset entries) will cross-reference these characteristics to verify that trace inputs and dataset entries match real-world scale and complexity. Be concrete and quantitative — write "5KB–500KB HTML pages," not "various HTML pages." ### 5. What are the hard problems / failure modes? What makes this app's job difficult? Where does it fail in practice? These become the most valuable eval scenarios: - For a scraper: "malformed HTML, dynamic JS-rendered content, complex nested schemas, very large pages that exceed context windows" - For a research agent: "conflicting sources, questions requiring multi-step reasoning, hallucinating citations" - For a voice agent: "ambiguous caller intent, account lookup failures, simultaneous tool calls" Each failure mode should map to at least one eval criterion (Step 1c) and at least one dataset entry (Step 4). --- ## Output: `pixie_qa/00-project-analysis.md` Write your findings to this file. **Complete all five sections before moving to sub-step 1b.** This document is referenced by every subsequent step. ### Template ```markdown # Project Analysis ## What this software does <One paragraph: what it does, in plain language. Not class names or file paths — what problem does it solve for its users?> ## Target users and value proposition <Who uses it, why, what problem it solves that alternatives don't> ## Capability inventory 1. <Capability name>: <one-line description> 2. <Capability name>: <one-line description> 3. ... ## Realistic input characteristics <What real-world in