Eval Driven Dev

Name: Eval Driven Dev
Author: github

github/awesome-copilot

3.6k installs
37.1k repo stars
Updated July 28, 2026
github/awesome-copilot

eval-driven-dev is a skill that builds a pixie-qa evaluation pipeline testing Python LLM apps end-to-end with real model calls and scored evaluators.

About

eval-driven-dev implements an automated evaluation pipeline for Python LLM applications using pixie-qa and pixie test. Invoke it when users ask to add evals, benchmark LLM behavior, set up QA, or fix wrong outputs in Python projects that call real models. The workflow runs six sequential steps: understand the app and define eval criteria, instrument data boundaries with wrap, implement a Runnable in pixie_qa/run_app.py, define evaluators, build a golden dataset, run pixie test, and analyze outcomes into action plans. Critical rules forbid mocking the LLM in eval Runnable code because that makes scores tautological; external data sources are injected via instrumentations while app routing and prompt assembly run for real. Setup activates a virtual environment, runs resources/setup.sh to install pixie-qa, pixie init, and pixie start for the results web UI. Deliverables include pixie_qa markdown context files, reference-trace.jsonl, evaluator mapping, dataset JSON, scored test results, and per-dataset analysis with action-plan summaries. Compatibility requires Python 3.10 plus pixie-qa version 0.8.4 or newer.

Six-step pixie-qa workflow from project analysis through pixie test execution and outcome analysis.
Requires real LLM calls in the Runnable; forbids mocking or stubbing the model in eval harness code.
Instruments app data boundaries with wrap() while external sources receive test-controlled inputs.
Produces pixie_qa context files, reference traces, datasets, evaluator mappings, and action plans.
Setup script installs pixie-qa, runs pixie init, and starts a background web server for result review.

Eval Driven Dev by the numbers

3,571 all-time installs (skills.sh)
+125 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #304 of 2,184 Testing & QA skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

At a glance

eval-driven-dev capabilities & compatibility

Capabilities: project and entry point analysis · wrap() boundary instrumentation · runnable harness implementation · evaluator and dataset authoring · pixie test execution and outcome analysis
Use cases: testing · debugging

From the docs

What eval-driven-dev says it does

compatibility: Python 3.10+

SKILL.md

npx skills add https://github.com/github/awesome-copilot --skill eval-driven-dev

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/github/awesome-copilot/eval-driven-dev.svg)](https://skillselion.com/skills/github/awesome-copilot/eval-driven-dev)

Installs	3.6k
repo stars	★ 37.1k
Security audit	2 / 3 scanners passed
Last updated	July 28, 2026
Repository	github/awesome-copilot ↗

How do I add LLM evals to a Python app with instrumented traces, golden datasets, and pass-fail pixie test scores?

Build a pixie-qa evaluation pipeline for Python LLM apps with real LLM calls, instrumented traces, and scored test runs.

Who is it for?

Python teams shipping LLM features who need evaluation-driven QA with pixie-qa instead of mocked unit tests.

Skip if: Skip for non-Python projects, pure unit tests without LLM behavior scoring, or apps where the LLM must be faked.

When should I use this skill?

User asks to add evals, benchmark LLM quality, set up QA for Python LLM apps, or run pixie test on application behavior.

What you get

A working pixie test run with evaluator scores, completed pending evaluations, dataset analysis, and a prioritized action plan.

product purpose summary
eval criteria definitions
trace input and dataset specs

By the numbers

Step 1a investigates five questions about purpose, users, success, capabilities, and failures
Foundation step determines eval criteria, trace inputs, and dataset entries

Files

SKILL.mdMarkdownGitHub ↗

Eval-Driven Development for Python LLM Applications

You're building an automated evaluation pipeline that tests a Python-based AI application end-to-end — running it the same way a real user would, with real inputs — then scoring the outputs using evaluators and producing pass/fail results via pixie test.

What you're testing is the app itself — its request handling, context assembly (how it gathers data, builds prompts, manages conversation state), routing, and response formatting. The app uses an LLM, which makes outputs non-deterministic — that's why you use evaluators (LLM-as-judge, similarity scores) instead of assertEqual — but the thing under test is the app's code, not the LLM.

During evaluation, the app's own code runs for real — routing, prompt assembly, LLM calls, response formatting — nothing is mocked or stubbed. But the data the app reads from external sources (databases, caches, third-party APIs, voice streams) is replaced with test-specified values via instrumentations. This means each test case controls exactly what data the app sees, while still exercising the full application code path.

Rule: The app's LLM calls must go to a real LLM. Do not replace, mock, stub, or intercept the LLM with a fake implementation. The LLM is the core value-generating component — replacing it makes the eval tautological (you control both inputs and outputs, so scores are meaningless). If the project's test suite contains LLM mocking patterns, those are for the project's own unit tests — do NOT adopt them for the eval Runnable.

The deliverable is a working `pixie test` run with real scores — not a plan, not just instrumentation, not just a dataset.

This skill is about doing the work, not describing it. Read code, edit files, run commands, produce a working pipeline.

---

Before you start

First, activate the virtual environment. Identify the correct virtual environment for the project and activate it. After the virtual environment is active, run the setup.sh included in the skill's resources. The script updates the eval-driven-dev skill and pixie-qa python package to latest version, initialize the pixie working directory if it's not already initialized, and start a web server in the background to show user updates.

Setup error handling — what you can skip vs. what must succeed:

Skill update fails → OK to continue. The existing skill version is sufficient.
pixie-qa upgrade fails but was already installed → OK to continue with the existing version.
pixie-qa is NOT installed and installation fails → STOP. Ask the user for help. The workflow cannot proceed without the pixie package.
`pixie init` fails → STOP. Ask the user for help.
`pixie start` (web server) fails → STOP. Ask the user for help. Check server.log in the pixie root directory for diagnostics. Common causes: port conflict, missing dependency, slow environment. Do NOT proceed without the web server — the user needs it to see eval results.

---

The workflow

Follow Steps 1–6 straight through without stopping. Do not ask the user for confirmation at intermediate steps — verify each step yourself and continue.

How to work — read this before doing anything else:

One step at a time. Read only the current step's instructions. Do NOT read Steps 2–6 while working on Step 1.
Read references only when a step tells you to. Each step names a specific reference file. Read it when you reach that step — not before.
Create artifacts immediately. After reading code for a sub-step, write the output file for that sub-step before moving on. Don't accumulate understanding across multiple sub-steps before writing anything.
Verify, then move on. Each step has a checkpoint. Verify it, then proceed to the next step. Don't plan future steps while verifying the current one.

When to stop and ask for help:

Some blockers cannot and should not be worked around. When you encounter any of the following, stop immediately and ask the user for help — do not attempt workarounds:

Application won't run due to missing environment variables or configuration: The app requires environment variables or configuration that are not set and cannot be inferred. Do NOT work around this by mocking, faking, or replacing application components — the eval must exercise real production code. Ask the user to fix the environment setup.
App import failures that indicate a broken project: If the app's core modules cannot be imported due to missing system dependencies or incompatible Python versions (not just missing pip packages you can install), ask the user to fix the project setup.
Ambiguous entry point: If the app has multiple equally plausible entry points and the project analysis doesn't clarify which one matters most, ask the user which to target.

Blockers you SHOULD resolve yourself (do not ask): missing Python packages (install them), missing pixie package (install it), port conflicts (pick a different port), file permission issues (fix them).

Run Steps 1–6 in sequence. If the user's prompt makes it clear that earlier steps are already done (e.g., "run the existing tests", "re-run evals"), skip to the appropriate step. When in doubt, start from Step 1.

---

Step 1: Understand the app and define eval criteria

First, check the user's prompt for specific requirements. Before reading app code, examine what the user asked for:

Referenced documents or specs: Does the prompt mention a file to follow (e.g., "follow the spec in EVAL_SPEC.md", "use the methodology in REQUIREMENTS.md")? If so, read that file first — it may specify datasets, evaluation dimensions, pass criteria, or methodology that override your defaults.
Specified datasets or data sources: Does the prompt reference specific data files (e.g., "use questions from eval_inputs/research_questions.json", "use the scenarios in call_scenarios.json")? If so, read those files — you must use them as the basis for your eval dataset, not fabricate generic alternatives.
Specified evaluation dimensions: Does the prompt name specific quality aspects to evaluate (e.g., "evaluate on factuality, completeness, and bias", "test identity verification and tool call correctness")? If so, every named dimension must have a corresponding evaluator in your test file.

If the prompt specifies any of the above, they take priority. Read and incorporate them before proceeding.

Step 1 has three sub-steps. Each reads its own reference file and produces its own output file. Complete each sub-step fully before starting the next.

Sub-step 1a: Project analysis

Reference: Read references/1-a-project-analysis.md now.

Before looking at code structure or entry points, understand what this software does in the real world — its purpose, its users, the complexity of real inputs, and where it fails. This understanding drives every downstream decision: which entry points matter most, what eval criteria to define, what trace inputs to use, and what dataset entries to create. Write the detailed context file before moving on. Note: the project may contain tests/, fixtures/, examples/, mock servers, and documentation — these are the project's own development infrastructure, NOT data sources for your eval pipeline. Ignore them when sourcing trace inputs and dataset content.

Checkpoint: pixie_qa/00-project-analysis.md written — covering what the software does, target users, capability inventory (at least 3 capabilities if the project has them), realistic input characteristics, and hard problems / failure modes (at least 2).

Sub-step 1b: Entry point & execution flow

Reference: Read references/1-b-entry-point.md now.

Read the source code to understand how the app starts and how a real user invokes it. Use the capability inventory from pixie_qa/00-project-analysis.md to prioritize entry points — focus on the entry point(s) that exercise the most valuable capabilities, not just the first one found. Write the detailed context file before moving on.

Checkpoint: pixie_qa/01-entry-point.md written — covering entry point, execution flow, user-facing interface, and env requirements.

Sub-step 1c: Eval criteria

Reference: Read references/1-c-eval-criteria.md now.

Define the app's use cases and eval criteria. Derive use cases from the capability inventory in pixie_qa/00-project-analysis.md. Derive eval criteria from the hard problems / failure modes — not generic quality dimensions. Use cases drive dataset creation (Step 4); eval criteria drive evaluator selection (Step 3). Write the detailed context file before moving on.

Checkpoint: pixie_qa/02-eval-criteria.md written — covering use cases, eval criteria, and their applicability scope. Do NOT read Step 2 instructions yet.

---

Step 2: Instrument, run application, and capture a reference trace

Step 2 has three sub-steps. Each reads its own reference file. Complete each sub-step before starting the next.

Sub-step 2a: Instrument with `wrap`

Reference: Read references/2a-instrumentation.md now.

Add wrap() calls at the app's data boundaries so the eval harness can inject controlled inputs and capture outputs. This makes the app testable without changing its logic.

Checkpoint: wrap() calls added at all data boundaries. Every eval criterion from pixie_qa/02-eval-criteria.md has a corresponding data point.

Sub-step 2b: Implement the Runnable

Reference: Read references/2b-implement-runnable.md now.

Write a Runnable class that lets the eval harness invoke the app exactly as a real user would. The Runnable should be simple — it just wires up the app's real entry point to the harness interface. If it's getting complicated, something is wrong.

Checkpoint: pixie_qa/run_app.py written. The Runnable calls the app's real entry point with real LLM configuration — no mocking, no faking, no component replacement.

Sub-step 2c: Capture and verify a reference trace

Reference: Read references/2c-capture-and-verify-trace.md now.

Run the app through the Runnable and capture a trace. The trace proves instrumentation and the Runnable are working correctly, and provides the data shapes needed for dataset creation in Step 4.

Checkpoint: pixie_qa/reference-trace.jsonl exists. All expected wrap entries and llm_span entries appear. pixie format shows all data points needed for evaluation. Do NOT read Step 3 instructions yet.

---

Step 3: Define evaluators

Reference: Read references/3-define-evaluators.md now for the detailed sub-steps.

Goal: Turn the qualitative eval criteria from Step 1c into concrete, runnable scoring functions. Each criterion maps to either a built-in evaluator, an agent evaluator (the default for any semantic or qualitative criterion), or a manual custom function (only for mechanical/deterministic checks like regex or field existence). The evaluator mapping artifact bridges between criteria and the dataset, ensuring every quality dimension has a scorer. Select evaluators that measure the hard problems identified in pixie_qa/00-project-analysis.md — not just generic quality dimensions.

Checkpoint: All evaluators implemented. pixie_qa/03-evaluator-mapping.md written with criterion-to-evaluator mapping and decision rationale. Do NOT read Step 4 instructions yet.

---

Step 4: Build the dataset

Reference: Read references/4-build-dataset.md now for the detailed sub-steps.

Goal: Create the test scenarios that tie everything together — the runnable (Step 2), the evaluators (Step 3), and the use cases (Step 1c). Each dataset entry defines what to send to the app, what data the app should see from external services, and how to score the result. Use the reference trace from Step 2 as the source of truth for data shapes and field names. Cover entries from the capability inventory in pixie_qa/00-project-analysis.md and include entries targeting the failure modes identified there. Do NOT use the project's own test fixtures, mock servers, or example data as dataset `eval_input` content — source real-world data instead. Every `wrap(purpose="input")` in the app must have pre-captured content in each entry's `eval_input` — do NOT leave eval_input empty when the app has input wraps.

Checkpoint: Dataset JSON created at pixie_qa/datasets/<name>.json with diverse entries covering all use cases. Dataset realism audit passed — entries use real-world data at representative scale, no project test fixtures contamination, at least one entry targets a failure mode with uncertain outcome, and every eval_input has captured content for all input wraps. Do NOT read Step 5 instructions yet.

---

Step 5: Run `pixie test` and fix mechanical issues

Reference: Read references/5-run-tests.md now for the detailed sub-steps.

Goal: Execute the full pipeline end-to-end and get it running without mechanical errors. This step is strictly about fixing setup and data issues in the pixie QA components (dataset, runnable, custom evaluators) — NOT about fixing the application itself or evaluating result quality. Once pixie test completes without errors and produces real evaluator scores for every entry, this step is done.

Checkpoint: pixie test runs to completion. Every dataset entry has evaluator scores (real EvaluationResult or PendingEvaluation). No setup errors, no import failures, no data validation errors.

If the test errors out, that's a mechanical bug in your QA components — fix and re-run. But once tests produce scores, move on. Do NOT assess result quality here — that's Step 6.

Always proceed to Step 6 after tests produce scores. Analysis is the essential final step — without it, pending evaluations are never completed and the user gets uninterpreted raw scores with no actionable insights. Do NOT stop here and ask the user whether to continue.

Cycle rule for iterative runs: Every successful pixie test invocation creates a concrete pixie_qa/results/<test_id> directory and starts a new analysis cycle. Before you edit application code, prompts, datasets, evaluators, or rerun pixie test, complete Step 6 for that exact results directory. Do not skip earlier cycles and analyze only the last run.

---

Step 6: Analyze outcomes

Reference: Read references/6-analyze-outcomes.md now — it has the complete three-phase analysis process, writing guidelines, and output format requirements.

Goal: Analyze pixie test results in a structured, data-driven process to produce actionable insights on test case quality, evaluator quality, and application quality. This step completes pending evaluations, writes per-entry and per-dataset analysis, and produces a prioritized action plan. Every statement must be backed by concrete data from the evaluation run — no speculation, no hand-waving.

Persisted analysis artifacts: In this trimmed workflow, persist analysis only at the dataset level and test-run level. Those artifacts still use a detailed version (for agent consumption: data points, evidence trails, reasoning chains) plus a summary version (for human review: concise TLDR readable in under 2 minutes). Do not create per-entry analysis files.

Hard completion gate: Step 6 is not complete until all of the following are true:

Every "status": "pending" entry in every pixie_qa/results/<test_id>/dataset-*/entry-*/evaluations.jsonl has been replaced with a scored result containing score and reasoning.
Every dataset directory has analysis.md and analysis-summary.md.
The test run root has action-plan.md and action-plan-summary.md.
You have run the Step 6 verifier script from this skill's resources/ directory against pixie_qa/results/<test_id>, and it reports success.

Explicitly not sufficient:

Writing a single top-level file such as pixie_qa/06-analysis.md
Saying pending evaluations are for the user to review in the web UI
Saying an entry "likely passes" without updating evaluations.jsonl

---

Web Server Management

pixie-qa runs a web server in the background for displaying context, traces, and eval results to the user. It's automatically started by the setup script (via pixie start, which launches a detached background process and returns immediately).

When the user is done with the eval-driven-dev workflow, inform them the web server is still running and you can clean it up with:

pixie stop

IMPORTANT: after the web server is stopped, the web UI becomes inaccessible. So only stop the server if the user confirms they're done with all web UI features. If they want to keep using the web UI, do NOT stop the server.

And whenever you restart the workflow, always run the setup.sh script in resources again to ensure the web server is running:

Step 1a: Project Analysis

Before looking at code structure, entry points, or writing any instrumentation, understand what this software does in the real world. This analysis is the foundation for every subsequent step — it determines which entry points to prioritize, what eval criteria to define, what trace inputs to use, and what dataset entries to build.

---

What to investigate

Read the project's README, documentation, and top-level source files. You're looking for answers to five questions:

1. What does this software do?

Write a one-paragraph plain-language summary. What problem does it solve? What does a successful run look like?

2. Who uses it and why?

Who are the target users? What's the primary use case? What problem does this solve that alternatives don't? This helps you understand what "quality" means for this app — a chatbot that chats with customers has different quality requirements than a research agent that synthesises multi-source reports.

3. Capability inventory

List the distinct capabilities, modes, or features the app offers. Be specific. for example:

For a scraping library: single-page scraping, multi-page scraping, search-based scraping, speech output, script generation
For a voice agent: greeting, FAQ handling, account lookup, transfer to human, call summarization
For a research agent: topic research, multi-source synthesis, citation generation, report formatting

Each capability may need its own entry point, its own trace, and its own dataset entries. This list directly feeds Step 1c (use cases) and Step 4 (dataset diversity).

4. What are realistic inputs?

Characterize the real-world inputs the app processes — not toy examples:

For a web scraper: "messy HTML pages with navigation, ads, dynamic content, tables, nested structures — typically 5KB-500KB of HTML"
For a research agent: "open-ended research questions requiring multi-source synthesis, with 3-10 sub-questions"
For a voice agent: "multi-turn conversations with background noise, interruptions, and ambiguous requests"

Be specific about scale (how large), complexity (how messy/diverse), and variety (what kinds). This directly feeds trace input selection (Step 2) — if you don't characterize realistic inputs here, you'll end up using toy inputs that bypass the app's real logic.

This section is an operational constraint, not just documentation. Steps 2c (trace input) and 4c (dataset entries) will cross-reference these characteristics to verify that trace inputs and dataset entries match real-world scale and complexity. Be concrete and quantitative — write "5KB–500KB HTML pages," not "various HTML pages."

5. What are the hard problems / failure modes?

What makes this app's job difficult? Where does it fail in practice? These become the most valuable eval scenarios:

For a scraper: "malformed HTML, dynamic JS-rendered content, complex nested schemas, very large pages that exceed context windows"
For a research agent: "conflicting sources, questions requiring multi-step reasoning, hallucinating citations"
For a voice agent: "ambiguous caller intent, account lookup failures, simultaneous tool calls"

Each failure mode should map to at least one eval criterion (Step 1c) and at least one dataset entry (Step 4).

---

Output: `pixie_qa/00-project-analysis.md`

Write your findings to this file. Complete all five sections before moving to sub-step 1b. This document is referenced by every subsequent step.

Template

# Project Analysis

## What this software does

<One paragraph: what it does, in plain language. Not class names or file paths — what problem does it solve for its users?>

## Target users and value proposition

<Who uses it, why, what problem it solves that alternatives don't>

## Capability inventory

1. <Capability name>: <one-line description>
2. <Capability name>: <one-line description>
3. ...

## Realistic input characteristics

<What real-world inputs look like — size, complexity, messiness, variety. Be specific about scale and structure.>

## Hard problems and failure modes

1. <Failure mode>: <why it's hard, what goes wrong>
2. <Failure mode>: <why it's hard, what goes wrong>
3. ...

Quality check

Before moving on, verify:

The "What this software does" section describes the app's purpose in terms a non-technical user would understand — not just "it runs a graph" or "it calls OpenAI"
The capability inventory lists at least 3 capabilities (if the project has them) — if you only found 1, you may have only looked at one part of the codebase
The realistic input characteristics describe real-world scale and complexity, not the simplest possible input
The failure modes are specific to this app's domain, not generic ("bad input" is not a failure mode; "malformed HTML with unclosed tags that breaks the parser" is)

What to ignore in the project

The project may contain directories and files that are part of its own development/test infrastructure — tests/, fixtures/, examples/, mock_server/, docs/, demo scripts, etc. These exist for the project's developers, not for your eval pipeline.

Critical: Do NOT use the project's test fixtures, mock servers, example data, or unit test infrastructure as inputs for your eval traces or dataset entries. They are designed for development speed and isolation — small, clean, deterministic data that bypasses every real-world difficulty. Using them produces trivially easy evaluations that cannot catch real quality issues.

When you encounter these directories during analysis, note their existence but treat them as implementation details of the project — not as data sources for your QA pipeline. Your QA pipeline must test the app against real-world conditions, not against the project's own test shortcuts.

Step 1b: Entry Point & Execution Flow

Identify how the application starts and how a real user invokes it. Use the capability inventory from pixie_qa/00-project-analysis.md to prioritize — focus on the entry point(s) that exercise the most valuable and frequently-used capabilities, not just the first one you find.

---

What to investigate

1. How the software runs

What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?

Look for:

if __name__ == "__main__" blocks
Framework entry points (FastAPI app, Flask app, Django manage.py)
CLI entry points in pyproject.toml ([project.scripts])
Docker/compose configs that reveal startup commands

2. The real user entry point

How does a real user or client invoke the app? This is what the eval must exercise — not an inner function that bypasses the request pipeline.

Web server: Which HTTP endpoints accept user input? What methods (GET/POST)? What request body shape?
CLI: What command-line arguments does the user provide?
Library/function: What function does the caller import and call? What arguments?

3. Environment and configuration

What env vars does the app require? (service endpoints, database URLs, feature flags)
What config files does it read?
What has sensible defaults vs. what must be explicitly set?

---

Output: `pixie_qa/01-entry-point.md`

Write your findings to this file. Keep it focused — only entry point and execution flow.

Template

# Entry Point & Execution Flow

## How to run

<Command to start the app, required env vars, config files>

## Entry point

- **File**: <e.g., app.py, main.py>
- **Type**: <FastAPI server / CLI / standalone function / etc.>
- **Framework**: <FastAPI, Flask, Django, none>

## User-facing endpoints / interface

<For each way a user interacts with the app:>

- **Endpoint / command**: <e.g., POST /chat, python main.py --query "...">
- **Input format**: <request body shape, CLI args, function params>
- **Output format**: <response shape, stdout format, return type>

## Environment requirements

| Variable | Purpose | Required? | Default |
| -------- | ------- | --------- | ------- |
| ...      | ...     | ...       | ...     |

Step 1c: Eval Criteria

Define what quality dimensions matter for this app — based on the project analysis (00-project-analysis.md) and the entry point (01-entry-point.md) you've already documented.

This document serves two purposes:

1. Dataset creation (Step 4): The use cases tell you what kinds of items to generate — each use case should have representative items in the dataset. 2. Evaluator selection (Step 3): The eval criteria tell you what evaluators to choose and how to map them.

Derive use cases from the capability inventory in pixie_qa/00-project-analysis.md. Derive eval criteria from the hard problems / failure modes — not generic quality dimensions like "factuality" or "relevance".

Keep this concise — it's a planning artifact, not a comprehensive spec.

---

What to define

1. Use cases

List the distinct scenarios the app handles. Derive these from the capability inventory in pixie_qa/00-project-analysis.md — each capability should map to at least one use case. Each use case becomes a category of dataset items. Each use case description must be a concise one-liner that conveys both (a) what the input is and (b) what the expected behavior or outcome is. The description should be specific enough that someone unfamiliar with the app can understand the scenario and its success criteria.

When possible, indicate the expected difficulty level for each use case — e.g., "routine" for straightforward cases, "challenging" for edge cases or failure-mode scenarios. This guides dataset creation (Step 4) to include entries across a range of difficulty levels rather than clustering at easy cases.

Good use case descriptions:

"Reroute to human agent on account lookup difficulties"
"Answer billing question using customer's plan details from CRM"
"Decline to answer questions outside the support domain"
"Summarize research findings including all queried sub-topics"

Bad use case descriptions (too vague):

"Handle billing questions"
"Edge case"
"Error handling"

2. Eval criteria

Define high-level, application-specific eval criteria — quality dimensions that matter for THIS app. Each criterion will map to an evaluator in Step 3.

Good criteria are specific to the app's purpose and derived from the hard problems / failure modes in pixie_qa/00-project-analysis.md. Examples:

Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation?"
Research report generator: "Does the report address all sub-questions?", "Are claims supported by retrieved sources?"
RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when context is missing?"
Web scraper: "Does the extracted data match the requested schema fields?", "Does it handle malformed HTML without crashing or losing data?"

Bad criteria are generic evaluator names dressed up as requirements. Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app. If your criteria could apply to any chatbot (e.g., "Groundedness", "PromptRelevance"), they're too generic — go back to the failure modes in 00-project-analysis.md and derive criteria from those.

At this stage, don't pick evaluator classes or thresholds. That comes in Step 3.

3. Check criteria applicability and observability

For each criterion:

1. Determine applicability scope — does this criterion apply to ALL use cases, or only a subset? If a criterion is only relevant for certain scenarios (e.g., "identity verification" only applies to account-related requests, not general FAQ), mark it clearly. This distinction is critical for Step 4 (dataset creation) because:

Universal criteria → become dataset-level default evaluators
Case-specific criteria → become item-level evaluators on relevant rows only

2. Verify observability — for each criterion, identify what data point in the app needs to be captured as a wrap() call to evaluate it. This drives the wrap coverage in Step 2.

If the criterion is about the app's final response → captured by wrap(purpose="output", name="response")
If it's about a routing decision → captured by wrap(purpose="state", name="routing_decision")
If it's about data the app fetched and used → captured by wrap(purpose="input", name="...")

---

Projects with multiple capabilities

If the project analysis (pixie_qa/00-project-analysis.md) lists multiple capabilities, you should evaluate at minimum the 2-3 most important / commonly used capabilities. Don't limit the dataset to a single capability when the project's value comes from breadth.

For each additional capability beyond the first:

Add use cases in 02-eval-criteria.md
Plan for a separate trace (run pixie trace with different entry points / configs) in Step 2
Plan dataset entries covering that capability in Step 4

If time or context constraints make it impractical to cover all capabilities, document which ones you covered and which you skipped (with rationale) at the end of 02-eval-criteria.md.

---

Criteria quality gate (mandatory self-check)

Before writing 02-eval-criteria.md, run this check on every criterion:

For each criterion, ask: "If the app returned a structurally correct but semantically wrong or hallucinated answer, would this criterion catch it?"

If the answer is "no" for ALL criteria, your criteria set is structural-only — it checks plumbing (fields exist, data flowed through) but not quality (content is correct, complete, non-hallucinated). You must add at least one semantic criterion that evaluates the _content_ of the app's output, not just its shape.
Structural criteria (field existence, JSON validity, format checks) are useful but insufficient. They pass even when the app returns fabricated or incorrect data.

Examples of structural vs semantic criteria:

Structural (checks shape)	Semantic (checks quality)
"Required fields are present in the output"	"Extracted values match the source content — no hallucinated data"
"Source type matches expected type"	"The app correctly interpreted noisy input without losing key facts"
"Output is valid JSON"	"The summary accurately captures the main points of the document"
"Response contains at least N characters"	"The response addresses the user's specific question, not a generic topic"

A good criteria set has both structural and semantic criteria. Structural criteria catch gross failures (app crashed, returned empty output). Semantic criteria catch quality failures (app ran but returned wrong/hallucinated/incomplete content).

---

Output: `pixie_qa/02-eval-criteria.md`

Write your findings to this file. Keep it short — the template below is the maximum length.

Template

# Eval Criteria

## Use cases

1. <Use case name>: <one-liner conveying input + expected behavior>
2. ...

## Eval criteria

| #   | Criterion | Applies to    | Data to capture |
| --- | --------- | ------------- | --------------- |
| 1   | ...       | All           | wrap name: ...  |
| 2   | ...       | Use case 1, 3 | wrap name: ...  |

## Capability coverage

Capabilities covered: <list>
Capabilities skipped (with rationale): <list or "none">

Step 2a: Instrument with `wrap`

For the full wrap() API reference, see wrap-api.md.

Goal: Add wrap() calls at data boundaries so the eval harness can (1) inject controlled inputs in place of real external dependencies, and (2) capture outputs for scoring.

---

Data-flow analysis

Starting from LLM call sites, trace backwards and forwards through the code to find:

Dependency input: data from external systems (databases, APIs, caches, file systems, network fetches)
App output: data going out to users or external systems
Intermediate state: internal decisions relevant to evaluation (routing, tool calls)

You do not need to wrap LLM call arguments or responses — those are already captured by OpenInference auto-instrumentation.

Adding `wrap()` calls

For each data point found, add a wrap() call in the application code:

import pixie

# External dependency data — function form (prevents the real call in eval mode)
profile = pixie.wrap(db.get_profile, purpose="input", name="customer_profile",
    description="Customer profile fetched from database")(user_id)

# External dependency data — function form (prevents the real call in eval mode)
history = pixie.wrap(redis.get_history, purpose="input", name="conversation_history",
    description="Conversation history from Redis")(session_id)

# App output — what the user receives
response = pixie.wrap(response_text, purpose="output", name="response",
    description="The assistant's response to the user")

# Intermediate state — internal decision relevant to evaluation
selected_agent = pixie.wrap(selected_agent, purpose="state", name="routing_decision",
    description="Which agent was selected to handle this request")

Value vs. function wrapping

# Value form: wrap a data value (result already computed)
profile = pixie.wrap(db.get_profile(user_id), purpose="input", name="customer_profile")

# Function form: wrap the callable — in eval mode the original function is
# NOT called; the registry value is returned instead.
profile = pixie.wrap(db.get_profile, purpose="input", name="customer_profile")(user_id)

CRITICAL: Always use function form for `purpose="input"` wraps on external calls — HTTP requests, database queries, API calls, file reads, cache lookups. Function form prevents the real call from executing in eval mode, so the dataset value is returned directly without making a live network request or database query. Value form still executes the real call first and only replaces the result afterwards — this wastes time, creates flaky tests, and makes evals dependent on external service availability.

The only case where value form is acceptable for purpose="input" is when the wrapped value is a local computation (no I/O, no side effects) that is cheap to recompute.

Placement rules

1. Wrap at the data boundary — where data enters or exits the application, not deep inside utility functions. 2. Names must be unique across the entire application (used as registry keys and dataset field names). 3. Use `lower_snake_case` for names. 4. Don't change the function's interface — wrap() is purely additive, returns the same type.

Placement by purpose

`purpose="input"` — where external data enters

Place input wraps at the boundary where external data enters the app, not at intermediate processing stages. In a pipeline architecture (fetch → process → extract → format):

Correct: wrap(fetch_page, purpose="input", name="fetched_page")(url) using function form at the HTTP fetch boundary — in eval mode, the fetch is skipped entirely and the dataset value is returned; in trace mode, the real fetch runs and the result is captured.
Incorrect: wrap(html_content, purpose="input", name="fetched_page") using value form — the HTTP fetch still runs in eval mode (wasting time and creating flaky tests), and only the result is replaced afterwards.
Incorrect: wrap(processed_chunks, purpose="input", name="chunks") after parsing — eval mode bypasses parsing and chunking entirely.

Principle: wrap(purpose="input") replaces the _minimum external dependency_ while exercising the _maximum internal logic_. Push the boundary as far upstream as possible. Always use function form for input wraps on external calls — this prevents the real call from executing in eval mode.

`purpose="output"` — where processed data exits

Track downstream from the LLM response to find where data leaves the app — sent to the user, written to storage, rendered in UI, or passed to an external system. Wrap at that exit boundary.

Don't wrap raw LLM responses — those are already captured by OpenInference auto-instrumentation as llm_span entries.
Wrap the app's final processed result — after any post-processing, formatting, or transformation the app applies to the LLM output.
If the app has multiple output channels (e.g., a response to the user AND a side-effect write to a database), wrap each one separately.

# Final response after the app's formatting pipeline
response = pixie.wrap(formatted_response, purpose="output", name="response",
    description="Final response sent to the user")

# Side-effect output — data written to external storage
pixie.wrap(saved_record, purpose="output", name="saved_summary",
    description="Summary record saved to the database")

Principle: output wraps are observation-only — they capture what the app produced so evaluators can score it. They are never mocked or injected during eval runs.

`purpose="state"` — internal decisions relevant to evaluation

Some eval criteria need to judge the app's internal reasoning — not just what went in or came out, but _how_ the app made decisions. Wrap internal state when an eval criterion requires it and the data isn't visible in inputs or outputs.

Common examples:

Agent routing: which sub-agent or tool was selected to handle a request
Plan/step decisions: what steps the agent chose to execute
Memory updates: what the agent added to or removed from its working memory
Retrieval results: which documents/chunks were retrieved before being fed to the LLM

# Agent routing decision
selected_agent = pixie.wrap(selected_agent, purpose="state", name="routing_decision",
    description="Which agent was selected to handle this request")

# Retrieved context fed to LLM
pixie.wrap(retrieved_chunks, purpose="state", name="retrieved_context",
    description="Document chunks retrieved by RAG before LLM call")

Principle: only wrap state that an eval criterion actually needs. Don't wrap every variable — state wraps are for internal data that evaluators must see but that doesn't appear in the app's inputs or outputs.

Coverage check

After adding all wrap() calls, go through each eval criterion from pixie_qa/02-eval-criteria.md and verify:

1. Every criterion that judges what went in has a corresponding input or entry wrap. 2. Every criterion that judges what came out has a corresponding output wrap. 3. Every criterion that judges how the app decided has a corresponding state wrap.

If a criterion needs data that isn't captured, add the wrap now — don't defer.

---

Output

Modified application source files with wrap() calls at data boundaries.

Step 2b: Implement the Runnable

For the full Runnable protocol and wrap() API, see wrap-api.md.

Goal: Write a Runnable class that lets the eval harness invoke the application exactly as a real user would.

---

The core idea

The Runnable is how pixie test and pixie trace run your application. Think of it as a programmatic stand-in for a real user: it starts the app, sends it a request, and lets the app do its thing. The eval harness calls run() for each test case, passing in the user's input parameters. The app processes those parameters through its real code — real routing, real prompt assembly, real LLM calls, real response formatting — and the harness observes what happens via the wrap() instrumentation from Step 2a.

This means the Runnable should be simple. It just wires up the app's real entry point to the harness interface. If your Runnable is getting complicated — if you're building custom logic, reimplementing app behavior, or replacing components — something is wrong.

Four requirements

1. Run the real production code

The Runnable calls the app's actual entry point — the same function, class, or endpoint a real user would trigger. It does not reimplement, shortcut, or substitute any part of the application.

This includes the LLM. The app's LLM calls must go through the real code path — do not mock, fake, or replace application components. The whole point of eval-based testing is that LLM outputs are non-deterministic, so you use evaluators (not assertions) to score them. If you replace any component with a fake, you've eliminated the real behavior and the eval measures nothing.

If the app won't run due to missing environment variables or configuration that you cannot resolve, stop and ask the user to fix the environment setup. Do not work around it by mocking components.

2. Represent start-up args with a Pydantic BaseModel

The run() method receives a Pydantic BaseModel whose fields are populated from the dataset's input_data. Define a subclass with the fields the app needs:

from pydantic import BaseModel

class AppArgs(BaseModel):
    user_message: str
    # Add more fields as the app's entry point requires.
    # These map 1:1 to the dataset input_data keys.

The fields must reflect what a real user actually provides. Read pixie_qa/00-project-analysis.md — the "Realistic input characteristics" section describes the complexity, scale, and variety of real inputs. Design the model to accept inputs at that level of realism, not simplified toy versions.

Understand the boundary between user-provided parameters and world data:

User-provided parameters (fields on the BaseModel): what a real user types or configures — prompts, queries, configuration flags, URLs, schema definitions.
World data (handled by wrap(purpose="input") in Step 2a): content the app fetches from external sources during execution — web pages, database records, API responses. This is NOT part of the BaseModel.

App type	BaseModel fields (user provides)	World data (wrap provides)
Web scraper	URL + prompt + schema definition	The HTML page content
Research agent	Research question + scope constraints	Source documents, search results
Customer support bot	Customer's spoken message	Customer profile from CRM, conversation history from session store
Code review tool	PR URL + review criteria	The actual diff, file contents, CI results

If a field ends up holding data the app would normally fetch itself, it probably belongs in a wrap(purpose="input") call instead of on the BaseModel.

3. Be concurrency-safe

run() is called concurrently for multiple dataset entries (up to 4 in parallel). If the app uses shared mutable state — SQLite, file-based DBs, global caches — protect access with asyncio.Semaphore:

import asyncio

class AppRunnable(pixie.Runnable[AppArgs]):
    _sem: asyncio.Semaphore

    @classmethod
    def create(cls) -> "AppRunnable":
        inst = cls()
        inst._sem = asyncio.Semaphore(1)
        return inst

    async def run(self, args: AppArgs) -> None:
        async with self._sem:
            await call_app(args.message)

Only add the semaphore when the app actually has shared mutable state. If the app uses per-request state (keyed by unique IDs) or is inherently stateless, concurrent calls are naturally isolated.

4. Adhere to the Runnable interface

class AppRunnable(pixie.Runnable[AppArgs]):
    @classmethod
    def create(cls) -> "AppRunnable": ...     # construct instance
    async def setup(self) -> None: ...        # once, before first run()
    async def run(self, args: AppArgs) -> None: ...  # per dataset entry, concurrent
    async def teardown(self) -> None: ...     # once, after last run()

create() — class method, returns a new instance. Use a quoted return type (-> "AppRunnable") to avoid forward reference errors.
setup() — optional async; initialize shared resources (HTTP clients, DB connections, servers).
run(args) — async; called per dataset entry. Invoke the app's real entry point here.
teardown() — optional async; clean up resources from setup().

Minimal example

# pixie_qa/run_app.py
from pydantic import BaseModel
import pixie


class AppArgs(BaseModel):
    user_message: str


class AppRunnable(pixie.Runnable[AppArgs]):
    """Drives the application for tracing and evaluation."""

    @classmethod
    def create(cls) -> "AppRunnable":
        return cls()

    async def run(self, args: AppArgs) -> None:
        from myapp import handle_request
        await handle_request(args.user_message)

That's it. The Runnable imports the app's real entry point and calls it. No custom logic, no component replacement, no clever workarounds.

Architecture-specific examples

Based on how the application runs, read the corresponding example file:

App type	Entry point	Example file
Standalone function (no server)	Python function	Read `references/runnable-examples/standalone-function.md`
Web server (FastAPI, Flask)	HTTP/WebSocket endpoint	Read `references/runnable-examples/fastapi-web-server.md`
CLI application	Command-line invocation	Read `references/runnable-examples/cli-app.md`

Read only the example file that matches your app type.

File placement

Place the file at pixie_qa/run_app.py.
The dataset's "runnable" field references: "pixie_qa/run_app.py:AppRunnable".
The project root is automatically on sys.path, so use normal imports (from app import service).

Technical note

Do NOT use from __future__ import annotations in runnable files — it breaks Pydantic's model resolution for nested models. Use quoted return types where needed instead.

---

Output

pixie_qa/run_app.py — the Runnable class.

Step 2c: Capture and verify a reference trace

Goal: Run the app through the Runnable, capture a trace, and verify that instrumentation and the Runnable are working correctly. The trace proves everything is wired up and provides the exact data shapes needed for dataset creation in Step 4.

---

Choose the trace input

The trace input determines what code paths are captured. A trivial input produces a trivial trace that misses the app's real behavior.

The input must reflect the "Realistic input characteristics" section, according to pixie_qa/00-project-analysis.md you've read in step 2b.

The input has two parts — understand the boundary between them:

User-provided parameters (you author): What a real user types or configures — prompts, queries, configuration flags, URLs, schema definitions. Write these to be representative of real usage.
World data (captured from production code, not fabricated): Content the app fetches from external sources during execution — database records, API responses, files, etc. Run the production code once to capture this data into the trace. Only resort to synthetic data generation when:
The user explicitly instructs you to use synthetic data, OR
Fetching from real sources is impractical (too many fetches, incurs real monetary cost, or takes unreasonably long — more than ~30 minutes)

Quick check before writing input: "Would a real user create this data, or would the app get it from somewhere else?" If the app gets it, let the production code run and capture it.

App type	User provides (you author)	World provides (you source)
Web scraper	URL + prompt + schema definition	The HTML page content
Research agent	Research question + scope constraints	Source documents, search results
Customer support bot	Customer's spoken message	Customer profile from CRM, conversation history from session store
Code review tool	PR URL + review criteria	The actual diff, file contents, CI results

Capture multiple traces

Capture at least 2 traces with different input characteristics before building the dataset:

Different complexity (simple case vs. complex case)
Different capabilities (see 00-project-analysis.md capability inventory)
Different edge conditions (missing optional data, unusually large input)

This calibration prevents dataset homogeneity — you see what the app actually does with varied inputs.

---

Run `pixie trace`

First, verify the app can be imported: python -c "from <module> import <class>". Catch missing packages before entering a trace-install-retry loop.

# Create a JSON file with input data
echo '{"user_message": "a realistic sample input"}' > pixie_qa/sample-input.json

uv run pixie trace --runnable pixie_qa/run_app.py:AppRunnable \
  --input pixie_qa/sample-input.json \
  --output pixie_qa/reference-trace.jsonl

The --input flag takes a file path to a JSON file (not inline JSON). The JSON keys become kwargs for the Pydantic model.

For additional traces:

uv run pixie trace --runnable pixie_qa/run_app.py:AppRunnable \
  --input pixie_qa/sample-input-complex.json \
  --output pixie_qa/trace-complex.jsonl

---

Verify the trace

Quick inspection

The trace JSONL contains one line per wrap() event and one line per LLM span:

{"type": "kwargs", "value": {"user_message": "What are your hours?"}}
{"type": "wrap", "name": "customer_profile", "purpose": "input", "data": {...}, ...}
{"type": "llm_span", "request_model": "gpt-4o", "input_messages": [...], ...}
{"type": "wrap", "name": "response", "purpose": "output", "data": "Our hours are...", ...}

Check that:

Expected wrap entries appear (one per wrap() call in the code)
At least one llm_span entry appears (confirms real LLM calls were made)
Missing entries indicate the execution path was different than expected — fix before continuing

Format and verify coverage

Run pixie format to see the data in dataset-entry format:

pixie format --input trace.jsonl --output dataset_entry.json

The output shows:

input_data: the exact keys/values for runnable arguments
eval_input: data from wrap(purpose="input") calls
eval_output: the actual app output (from wrap(purpose="output"))

For each eval criterion from pixie_qa/02-eval-criteria.md, verify the format output contains the data needed. If a data point is missing, go back to Step 2a and add the wrap() call.

Trace audit

Before proceeding to Step 3, audit every trace:

1. World data check: For each wrap(purpose="input") field, is the data realistically complex? Compare against 00-project-analysis.md "Realistic input characteristics." If the analysis says inputs are 5KB–500KB and yours is under 5KB, it's not representative.

2. LLM span check: Do llm_span entries appear? If not, the app's LLM calls didn't fire — the Runnable may be misconfigured or the LLM may be mocked/faked. Fix this before continuing.

3. Complexity check: Does the trace exercise the hard problems from 00-project-analysis.md? If it only exercises the happy path, capture an additional trace with harder inputs.

If any check fails, go back and fix the input or Runnable, then re-capture.

---

Output

pixie_qa/reference-trace.jsonl — reference trace with all expected wrap events and LLM spans
Additional trace files for varied inputs

Step 3: Define Evaluators

Why this step: With the app instrumented (Step 2), you now map each eval criterion to a concrete evaluator — implementing custom ones where needed — so the dataset (Step 4) can reference them by name.

---

3a. Map criteria to evaluators

Every eval criterion from Step 1c — including any dimensions specified by the user in the prompt — must have a corresponding evaluator. If the user asked for "factuality, completeness, and bias," you need three evaluators (or a multi-criteria evaluator that covers all three). Do not silently drop any requested dimension. Prioritize evaluators that measure the hard problems / failure modes identified in pixie_qa/00-project-analysis.md — these are more valuable than generic quality evaluators.

For each eval criterion, choose an evaluator using this decision order:

1. Built-in evaluator — if a standard evaluator fits the criterion (factual correctness → Factuality, exact match → ExactMatch, RAG faithfulness → Faithfulness). See evaluators.md for the full catalog. 2. Agent evaluator (create_agent_evaluator) — the default for all semantic, qualitative, and app-specific criteria. Agent evaluators are graded by you (the coding agent) in Step 6, where you review each entry's trace and output holistically. This is far more effective than automated scoring for criteria like "Did the extraction accurately capture the source content?", "Are there hallucinated values?", or "Did the app handle noisy input gracefully?" 3. Manual custom evaluator — ONLY for mechanical, deterministic checks where a programmatic function is definitively correct: field existence, regex pattern matching, JSON schema validation, numeric thresholds, type checking. Never use manual custom evaluators for semantic quality — if the check requires _judgment_ about whether content is correct, relevant, or complete, use an agent evaluator instead.

Distinguish structural from semantic criteria: For each criterion, ask: "Can this be checked with a simple programmatic rule that always gives the right answer?" If yes → manual custom evaluator. If no → agent evaluator. Most app-specific quality criteria are semantic, not structural.

For open-ended LLM text, never use ExactMatch — LLM outputs are non-deterministic.

AnswerRelevancy is RAG-only — it requires a context value in the trace. Returns 0.0 without it. For general relevance, use an agent evaluator with clear criteria.

3b. Implement custom evaluators

If any criterion requires a custom evaluator, implement it now. Place custom evaluators in pixie_qa/evaluators.py (or a sub-module if there are many).

Agent evaluators (`create_agent_evaluator`) — the default

Use agent evaluators for all semantic, qualitative, and judgment-based criteria. These are graded by you (the coding agent) in Step 5d, where you review each entry's trace and output with full context — far more effective than any automated approach for quality dimensions like accuracy, completeness, hallucination detection, or error handling.

from pixie import create_agent_evaluator

extraction_accuracy = create_agent_evaluator(
    name="ExtractionAccuracy",
    criteria="The extracted data accurately reflects the source content. All fields "
             "contain correct values from the source — no hallucinated, fabricated, or "
             "placeholder values. Compare the final_answer against the fetched_content "
             "and parsed_content to verify every claimed fact.",
)

noise_handling = create_agent_evaluator(
    name="NoiseHandling",
    criteria="The app correctly ignored navigation chrome, boilerplate, ads, and other "
             "non-content elements from the source. The extracted data contains only "
             "information relevant to the user's prompt, not noise from the page structure.",
)

schema_compliance = create_agent_evaluator(
    name="SchemaCompliance",
    criteria="The output contains all fields requested in the prompt with appropriate "
             "types and non-trivial values. Missing fields, null values for required data, "
             "or fields with generic placeholder text indicate failure.",
)

Reference agent evaluators in the dataset via filepath:callable_name (e.g., "pixie_qa/evaluators.py:extraction_accuracy").

During pixie test, agent evaluators show as ⏳ in the console. They are graded in Step 5d.

Writing effective criteria: The criteria string is the grading rubric you'll follow in Step 5d. Make it specific and actionable:

Bad: "Check if the output is good" — too vague to grade consistently
Bad: "The response should be accurate" — doesn't say what to compare against
Good: "Compare the extracted fields against the source HTML/document. Each field must have a corresponding passage in the source. Flag any field whose value cannot be traced back to the source content."
Good: "The app should preserve the structural hierarchy of the source document. If the source has sections/subsections, the extraction should reflect that nesting, not flatten everything into a single level."

Manual custom evaluator — for mechanical checks only

Use manual custom evaluators only for deterministic, programmatic checks where a simple function definitively gives the right answer. Examples: field existence, regex matching, JSON schema validation, numeric range checks, type verification.

Do NOT use manual custom evaluators for semantic quality. If the check requires _judgment_ about whether content is correct, relevant, complete, or well-written, use an agent evaluator instead. The litmus test: "Could a regex, string match, or comparison operator implement this check perfectly?" If not, it's semantic — use an agent evaluator.

Custom evaluators can be sync or async functions. Assign them to module-level variables in pixie_qa/evaluators.py:

from pixie import Evaluation, Evaluable

def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
    score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
    return Evaluation(score=score, reasoning="...")

Reference by filepath:callable_name in the dataset: "pixie_qa/evaluators.py:my_evaluator".

Accessing `eval_metadata` and captured data: Custom evaluators access per-entry metadata and wrap() outputs via the Evaluable fields:

evaluable.eval_metadata — dict from the entry's eval_metadata field (e.g., {"expected_tool": "endCall"})
evaluable.eval_output — list[NamedData] containing ALL wrap(purpose="output") and wrap(purpose="state") values. Each item has .name (str) and .value (JsonValue). Use the helper below to look up by name.

def _get_output(evaluable: Evaluable, name: str) -> Any:
    """Look up a wrap value by name from eval_output."""
    for item in evaluable.eval_output:
        if item.name == name:
            return item.value
    return None

def call_ended_check(evaluable: Evaluable, *, trace=None) -> Evaluation:
    expected = evaluable.eval_metadata.get("expected_call_ended") if evaluable.eval_metadata else None
    actual = _get_output(evaluable, "call_ended")
    if expected is None:
        return Evaluation(score=1.0, reasoning="No expected_call_ended in eval_metadata")
    match = bool(actual) == bool(expected)
    return Evaluation(
        score=1.0 if match else 0.0,
        reasoning=f"Expected call_ended={expected}, got {actual}",
    )

ValidJSON and string expectations conflict

ValidJSON treats the dataset entry's expectation field as a JSON Schema when present. If your entries use string expectations (e.g., for Factuality), adding ValidJSON as a dataset-level default evaluator will cause failures — it cannot validate a plain string as a JSON Schema. Either apply ValidJSON only to entries with object/boolean expectations, or omit it when the dataset relies on string expectations.

3c. Produce the evaluator mapping artifact

Write the criterion-to-evaluator mapping to pixie_qa/03-evaluator-mapping.md. This artifact bridges between the eval criteria (Step 1c) and the dataset (Step 4).

CRITICAL: Use the exact evaluator names as they appear in the evaluators.md reference — built-in evaluators use their short name (e.g., Factuality, ClosedQA), and custom evaluators use filepath:callable_name format (e.g., pixie_qa/evaluators.py:ConciseVoiceStyle).

Template

# Evaluator Mapping

## Built-in evaluators used

| Evaluator name | Criterion it covers | Applies to                 |
| -------------- | ------------------- | -------------------------- |
| Factuality     | Factual accuracy    | All items                  |
| ClosedQA       | Answer correctness  | Items with expected_output |

## Agent evaluators

| Evaluator name                             | Criterion it covers          | Applies to | Source file            |
| ------------------------------------------ | ---------------------------- | ---------- | ---------------------- |
| pixie_qa/evaluators.py:extraction_accuracy | Content accuracy vs source   | All items  | pixie_qa/evaluators.py |
| pixie_qa/evaluators.py:noise_handling      | Navigation/boilerplate noise | All items  | pixie_qa/evaluators.py |

## Manual custom evaluators (mechanical checks only)

| Evaluator name                                 | Criterion it covers  | Applies to | Source file            |
| ---------------------------------------------- | -------------------- | ---------- | ---------------------- |
| pixie_qa/evaluators.py:required_fields_present | Required field check | All items  | pixie_qa/evaluators.py |

## Applicability summary

- **Dataset-level defaults** (apply to all items): Factuality, pixie_qa/evaluators.py:extraction_accuracy
- **Item-specific** (apply to subset): ClosedQA (only items with expected_output)

Output

Custom evaluator implementations in pixie_qa/evaluators.py (if any custom evaluators needed)
pixie_qa/03-evaluator-mapping.md — the criterion-to-evaluator mapping

---

Evaluator selection guide: See evaluators.md for the full built-in evaluator catalog and create_agent_evaluator reference.

If you hit an unexpected error when implementing evaluators (import failures, API mismatch), read evaluators.md for the authoritative evaluator reference and wrap-api.md for API details before guessing at a fix.

Step 4: Build the Dataset

Why this step: The dataset ties everything together — the runnable (Step 2), the evaluators (Step 3), and the use cases (Step 1c) — into concrete test scenarios. At test time, pixie test calls the runnable with input_data, the wrap registry is populated with eval_input, and evaluators score the resulting captured outputs.

Before building entries, review:

`pixie_qa/00-project-analysis.md` — the capability inventory and failure modes. Dataset entries should cover entries from the capability inventory and include entries targeting the listed failure modes.
`pixie_qa/02-eval-criteria.md` — use cases and their capability coverage. Ensure every listed use case has representative entries.

---

Understanding `input_data`, `eval_input`, and `expectation`

Before building the dataset, understand what these terms mean:

`input_data` = the kwargs passed to Runnable.run() as a Pydantic model. These are the input data (user message, request body, CLI args). The keys must match the fields of the Pydantic model defined for run(args: T).

`eval_input` = a list of {"name": ..., "value": ...} objects corresponding to wrap(purpose="input") calls in the app. At test time, these are injected automatically by the wrap registry; wrap(purpose="input") calls in the app return the registry value instead of calling the real external dependency.

eval_input may be an empty list only when the app has no wrap(purpose="input") calls. If the app HAS input wraps, every dataset entry MUST provide corresponding `eval_input` values with pre-captured content — otherwise the app makes live external calls during eval, which is slow, flaky, and non-reproducible. See section 4b′ for how to capture this content.

Each item is a NamedData object with name (str) and value (any JSON-serializable value).

`expectation` (optional) = case-specific evaluation reference. What a correct output should look like for this scenario. Used by evaluators that compare output against a reference (e.g., Factuality, ClosedQA). Not needed for output-quality evaluators that don't require a reference.

eval output = what the app actually produces, captured at runtime by wrap(purpose="output") and wrap(purpose="state") calls. Not stored in the dataset — it's produced when pixie test runs the app.

The reference trace at pixie_qa/reference-trace.jsonl is your primary source for data shapes:

Filter it to see the exact serialized format for eval_input values
Read the kwargs record to understand the input_data structure
Read purpose="output"/"state" events to understand what outputs the app produces, so you can write meaningful expectation values

---

4a. Derive evaluator assignments

The eval criteria artifact (pixie_qa/02-eval-criteria.md) maps each criterion to use cases. The evaluator mapping artifact (pixie_qa/03-evaluator-mapping.md) maps each criterion to a concrete evaluator name. Combine these:

1. Dataset-level default evaluators: Criteria marked as applying to "All" use cases → their evaluator names go in the top-level "evaluators" array. 2. Item-level evaluators: Criteria that apply to only a subset → their evaluator names go in "evaluators" on the relevant rows only, using "..." to also include the defaults.

4b. Inspect data shapes with `pixie format`

Use pixie format on the reference trace to see the exact data shapes and the real app output in dataset-entry format:

uv run pixie format --input reference-trace.jsonl --output dataset-sample.json

The output looks like:

{
  "input_data": {
    "user_message": "What are your business hours?"
  },
  "eval_input": [
    {
      "name": "customer_profile",
      "value": { "name": "Alice", "tier": "gold" }
    },
    {
      "name": "conversation_history",
      "value": [{ "role": "user", "content": "What are your hours?" }]
    }
  ],
  "expectation": null,
  "eval_output": {
    "response": "Our business hours are Monday to Friday, 9am to 5pm..."
  }
}

Important: The eval_output in this template is the full real output produced by the running app. Do NOT copy eval_output into your dataset entries — it would make tests trivially pass by giving evaluators the real answer. Instead:

Use input_data and eval_input as exact templates for data keys and format
Look at eval_output to understand what the app produces — then write a concise `expectation` description that captures the key quality criteria for each scenario

Example: if eval_output.response is "Our business hours are Monday to Friday, 9 AM to 5 PM, and Saturday 10 AM to 2 PM.", write expectation as "Should mention weekday hours (Mon–Fri 9am–5pm) and Saturday hours" — a short description a human or LLM evaluator can compare against.

4b′. Capture external content for `eval_input` (mandatory)

CRITICAL: If the app has ANY wrap(purpose="input") calls, every dataset entry MUST provide corresponding eval_input values with pre-captured real content. An empty eval_input list means the app will make live external calls (HTTP requests, database queries, API calls) during every eval run — this makes evals slow, flaky, and non-reproducible.

Why this matters

During pixie test, each wrap(purpose="input", name="X") call in the app checks the wrap registry for a value named "X":

If found: the registered value is returned directly (no external call)
If not found: the real external call executes (non-deterministic, slow, may fail)

An eval_input: [] entry means NOTHING is in the registry, so every external dependency runs live. This defeats the purpose of instrumentation.

How to capture content

For each wrap(purpose="input", name="X") in the app, you must capture the real data once and embed it in the dataset. Choose one of these approaches:

Option A — Use the reference trace (preferred):

The reference trace from Step 2c already contains captured values for every purpose="input" wrap. Extract them:

# View the reference trace to find input wrap values
grep '"purpose": "input"' pixie_qa/reference-trace.jsonl

Or use pixie format to see the data in dataset-entry format — the eval_input array in the output already has the captured values with correct names and shapes.

Option B — Fetch content directly (for new entries with different inputs):

When creating dataset entries with different input sources (e.g., different URLs, different queries), capture the content by running the dependency code once:

# Example: for a web scraper, run the app's own fetch logic once
from myapp.fetcher import fetch_page
page_content = fetch_page(target_url)  # use the app's real code path

Then include the captured content in the entry's eval_input:

{
  "eval_input": [
    {
      "name": "fetch_result",
      "value": "<captured page content here>"
    }
  ]
}

Option C — Run `pixie trace` with each input (most thorough):

For each set of input_data, run pixie trace to execute the app with real dependencies and capture all values:

pixie trace --runnable pixie_qa/run_app.py:AppRunnable --input  trace-input.json

Then extract the purpose="input" values from the resulting trace and use them as eval_input.

Content format

The eval_input value must match the exact type and format that the wrap() call returns. Check the reference trace to see what format the app produces:

If the wrap captures a string (e.g., HTML content, markdown text), the value is a string
If the wrap captures a dict (e.g., database record), the value is a JSON object
If the wrap captures a list, the value is a JSON array

Do NOT skip this step. Every wrap(purpose="input") in the app must have a corresponding eval_input entry in every dataset row. If you proceed with empty eval_input when the app has input wraps, evals will be unreliable.

4c. Generate dataset items

Create diverse entries guided by the reference trace and use cases:

`input_data` keys must match the fields of the Pydantic model used in Runnable.run(args: T)
`eval_input` must be a list of {"name": ..., "value": ...} objects matching the name values of wrap(purpose="input") calls in the app
Cover each use case from pixie_qa/02-eval-criteria.md — at least one entry per use case, with meaningfully diverse inputs across entries

If the user specified a dataset or data source in the prompt (e.g., a JSON file with research questions or conversation scenarios), read that file, adapt each entry to the input_data / eval_input shape, and incorporate them into the dataset. Do NOT ignore specified data.

Entry quality checklist

Before finalizing the dataset, verify each entry against these criteria:

Input realism:

Does eval_input contain world data that respects the synthesization boundary (see Step 2c)? User-authored parameters are fine; world data should be sourced, not fabricated from scratch.
Does the world data in eval_input match the scale and complexity described in 00-project-analysis.md "Realistic input characteristics"? If the analysis says inputs are typically 5KB–500KB, a 200-char input is not realistic.
Is the answer to the prompt non-trivial to extract from the input? A test where the answer is in a clearly labeled HTML tag or the first sentence doesn't test extraction quality.

Scenario diversity:

Do entries cover meaningfully different difficulty levels — not just different topics with the same difficulty?
Does at least one entry target a failure mode from 00-project-analysis.md that you expect might actually cause degraded scores (not a guaranteed pass)?
Do entries use different structural patterns in the input data (not just different content poured into the same template)?

Difficulty calibration:

Is there at least one entry you are genuinely uncertain whether the app will handle correctly? If you're confident every entry will pass trivially, the dataset is too easy.
Consider including one intentionally challenging entry that probes a known limitation — a "stress test" entry. If it passes, great. If it fails, the eval has demonstrated it can catch real issues.

Anti-patterns for dataset entries

Fabricating world data: Hand-authoring content the app would normally fetch from external sources (e.g., writing HTML for a web scraper, writing "retrieved documents" for a RAG system). This removes real-world complexity.
Uniform difficulty: All entries have the same complexity level. Real workloads have a distribution — some easy, some hard, some edge cases.
Obvious answers: Every entry has the target information cleanly labeled and unambiguous. Real data often has the answer scattered, partially present, duplicated with variations, or embedded in noise.
Round-trip authorship: You wrote both the input and the expected output, so you know exactly what's there. A real evaluator tests whether the app can find information it hasn't seen before.
Only happy paths: No entry tests error conditions, edge cases, or known failure modes.
Building all entries from the same toy trace with minor rephrasing: If all entries have similar input_data and similar eval_input data, the dataset tests nothing meaningful. Each entry should represent a meaningfully different scenario.
Reusing the project's own test fixtures as eval data: The project's tests/, fixtures/, examples/, and mock_server/ directories contain data designed for unit/integration tests — small, clean, deterministic, and trivially easy. Using them as eval_input data guarantees 100% pass rates and zero quality signal. Even if these fixtures look convenient, they bypass every real-world difficulty that makes the app's job hard. Run the production code to capture realistic data instead, or generate synthetic data that matches the scale/complexity from 00-project-analysis.md.
Using a project's mock/fake implementations: If the project includes mock LLMs, fake HTTP servers, or stub services in its test infrastructure, do NOT use them in your eval pipeline. Your eval must exercise the app's real code paths with realistically complex data — not the project's own test shortcuts.

4c′. Verify coverage against project analysis

Before writing the final dataset JSON, open pixie_qa/00-project-analysis.md and check:

1. Realistic input characteristics: For each characteristic listed (size, complexity, noise, variety), confirm at least one dataset entry reflects it. If the analysis says "messy inputs with navigation and ads," at least one entry's eval_input should contain messy data with navigation and ads. 2. Failure modes: For each failure mode listed, confirm at least one dataset entry is designed to exercise it. The entry doesn't need to guarantee failure — but it should create conditions where that failure mode _could_ manifest. If a failure mode cannot be exercised with the current instrumentation setup, add a note in 02-eval-criteria.md explaining why. 3. Capability coverage: Confirm the dataset covers the capabilities listed in the eval criteria (Step 1c). Each covered capability should have at least one entry.

If any gap is found, add entries to close it before proceeding to 4d.

4c″. STOP CHECK — Dataset realism audit (hard gate)

This is a hard gate. Do NOT proceed to 4d until every check passes. If any check fails, revise the dataset and re-audit.

Before writing the final dataset JSON, perform this self-audit:

1. Cross-reference `00-project-analysis.md`: Open the "Realistic input characteristics" section. For each characteristic (size, complexity, noise, structure), verify at least one dataset entry's eval_input reflects it. If the analysis says "5KB–500KB HTML pages with navigation chrome and ads" and your largest eval_input is 1KB of clean HTML, the dataset is not realistic — add harder entries.

2. Count distinct sources: How many unique eval_input data sources are in the dataset? If more than 50% of entries share the same eval_input content (even with different prompts), the dataset lacks diversity. Prompt variations on the same input test the LLM's interpretation, not the app's data processing.

3. Difficulty distribution (mandatory threshold): For each entry, label it as "routine" (confident it will pass), "moderate" (likely passes but non-trivial), or "challenging" (genuinely uncertain or targeting a known failure mode).

Maximum 60% "routine" entries. If you have 5 entries, at most 3 can be routine.
At least one "challenging" entry that targets a failure mode from 00-project-analysis.md where you are genuinely uncertain about the outcome. If every entry is a guaranteed pass, the dataset cannot distinguish a good app from a broken one.

4. Capability coverage (mandatory threshold): Count how many capabilities from 00-project-analysis.md are exercised by at least one dataset entry.

Must cover ≥50% of listed capabilities. If the analysis lists 6 capabilities, the dataset must exercise at least 3.
If coverage is below threshold, add entries targeting the uncovered capabilities.

5. Project fixture contamination check: Scan every eval_input value. Did any data originate from the project's tests/, fixtures/, examples/, or mock server directories? If yes, replace it with real-world data. These fixtures are designed for development convenience, not evaluation realism.

6. Tautology check: Will the test pipeline produce meaningful scores, or is it a closed loop? If you authored both the input data and the evaluator logic such that passing is guaranteed by construction (e.g., regex extractor + exact-match evaluator on hand-authored HTML), the pipeline is tautological and cannot catch real issues. The app's real LLM should produce the output, and evaluators should assess quality dimensions that can genuinely fail.

7. `eval_input` completeness check: For every wrap(purpose="input", name="X") call in the instrumented app code, verify that EVERY dataset entry provides a corresponding eval_input item with "name": "X" and a non-empty "value". If any entry has eval_input: [] while the app has input wraps, the dataset is incomplete — captured content is missing. Go back to step 4b′ and capture the content.

4d. Build the dataset JSON file

Create the dataset at pixie_qa/datasets/<name>.json:

{
  "name": "qa-golden-set",
  "runnable": "pixie_qa/run_app.py:AppRunnable",
  "evaluators": ["Factuality", "pixie_qa/evaluators.py:ConciseVoiceStyle"],
  "entries": [
    {
      "input_data": {
        "user_message": "What are your business hours?"
      },
      "description": "Customer asks about business hours with gold tier account",
      "eval_input": [
        {
          "name": "customer_profile",
          "value": { "name": "Alice Johnson", "tier": "gold" }
        }
      ],
      "expectation": "Should mention Mon-Fri 9am-5pm and Sat 10am-2pm"
    },
    {
      "input_data": {
        "user_message": "I want to change something"
      },
      "description": "Ambiguous change request from basic tier customer",
      "eval_input": [
        {
          "name": "customer_profile",
          "value": { "name": "Bob Smith", "tier": "basic" }
        }
      ],
      "expectation": "Should ask for clarification",
      "evaluators": ["...", "ClosedQA"]
    },
    {
      "input_data": {
        "user_message": "I want to end this call"
      },
      "description": "User requests call end after failed verification",
      "eval_input": [
        {
          "name": "customer_profile",
          "value": { "name": "Charlie Brown", "tier": "basic" }
        }
      ],
      "expectation": "Agent should call endCall tool and end the conversation",
      "eval_metadata": {
        "expected_tool": "endCall",
        "expected_call_ended": true
      },
      "evaluators": ["...", "pixie_qa/evaluators.py:tool_call_check"]
    }
  ]
}

Key fields

Entry structure — all fields are top-level on each entry (flat structure — no nesting):

entry:
  ├── input_data    (required) — args for Runnable.run()
  ├── eval_input      (optional) — list of {"name": ..., "value": ...} objects (default: [])
  ├── description     (required) — human-readable label for the test case
  ├── expectation     (optional) — reference for comparison-based evaluators
  ├── eval_metadata   (optional) — extra per-entry data for custom evaluators
  └── evaluators      (optional) — evaluator names for THIS entry

Top-level fields:

`runnable` (required): filepath:ClassName reference to the Runnable class from Step 2 (e.g., "pixie_qa/run_app.py:AppRunnable"). Path is relative to the project root.
`evaluators` (dataset-level, optional): Default evaluator names applied to every entry — the evaluators for criteria that apply to ALL use cases.

Per-entry fields (all top-level on each entry):

`input_data` (required): Keys match the Pydantic model fields for Runnable.run(args: T). These are the app's input data.
`eval_input` (optional, default []): List of {"name": ..., "value": ...} objects. Names match wrap(purpose="input") names in the app. The runner automatically prepends input_data when building the Evaluable.
`description` (required): Use case one-liner from pixie_qa/02-eval-criteria.md.
`expectation` (optional): Case-specific expectation text for evaluators that need a reference.
`eval_metadata` (optional): Extra per-entry data for custom evaluators — e.g., expected tool names, boolean flags, thresholds. Accessible in evaluators as evaluable.eval_metadata.
`evaluators` (optional): Row-level evaluator override.

Evaluator assignment rules

1. Evaluators that apply to ALL items go in the top-level "evaluators" array. 2. Items that need additional evaluators use "evaluators": ["...", "ExtraEval"] — "..." expands to defaults. 3. Items that need a completely different set use "evaluators": ["OnlyThis"] without "...". 4. Items using only defaults: omit the "evaluators" field.

---

Dataset Creation Reference

Using `eval_input` values

The eval_input values are {"name": ..., "value": ...} objects. Use the reference trace as templates — copy the "data" field from the relevant purpose="input" event and adapt the values:

Simple dict:

{ "name": "customer_profile", "value": { "name": "Alice", "tier": "gold" } }

List of dicts (e.g., conversation history):

{
  "name": "conversation_history",
  "value": [
    { "role": "user", "content": "Hello" },
    { "role": "assistant", "content": "Hi there!" }
  ]
}

Important: The exact format depends on what the wrap(purpose="input") call captures. Always copy from the reference trace rather than constructing from scratch.

Crafting diverse eval scenarios

Cover different aspects of each use case. Refer to `pixie_qa/00-project-analysis.md` for the capability inventory and failure modes:

Cover each capability — at least one entry per capability from the capability inventory, not just the primary capability
Target failure modes — include entries that exercise the hard problems / failure modes listed in the project analysis (e.g., malformed input, edge cases, complex scenarios)
Different user phrasings of the same request
Edge cases (ambiguous input, missing information, error conditions)
Entries that stress-test specific eval criteria
At least one entry per use case from Step 1c

---

Output

pixie_qa/datasets/<name>.json — the dataset file.

Step 5: Run `pixie test` and Fix Mechanical Issues

Why this step: Run pixie test and fix mechanical issues in your QA components — dataset format problems, runnable implementation bugs, and custom evaluator errors — until every entry produces real scores. This step is NOT about assessing result quality or fixing the application itself.

---

5a. Run tests

uv run pixie test

For verbose output with per-case scores and evaluator reasoning:

uv run pixie test -v

pixie test automatically loads the .env file before running tests.

The evaluation harness:

1. Resolves the Runnable class from the dataset's runnable field 2. Calls Runnable.create() to construct an instance, then setup() once 3. Runs all dataset entries concurrently (up to 4 in parallel): a. Reads input_data and eval_input from the entry b. Populates the wrap input registry with eval_input data c. Initialises the capture registry d. Validates input_data into the Pydantic model and calls Runnable.run(args) e. wrap(purpose="input") calls in the app return registry values instead of calling external services f. wrap(purpose="output"/"state") calls capture data for evaluation g. Builds Evaluable from captured data h. Runs evaluators 4. Calls Runnable.teardown() once

Because entries run concurrently, the Runnable's run() method must be concurrency-safe. If you see sqlite3.OperationalError, "database is locked", or similar errors, add a Semaphore(1) to your Runnable (see the concurrency section in Step 2 reference).

5b. Fix mechanical issues only

This step is strictly about fixing what you built in previous steps — the dataset, the runnable, and any custom evaluators. You are fixing mechanical problems that prevent the pipeline from running, NOT assessing or improving the application's output quality.

What counts as a mechanical issue (fix these):

Error	Cause	Fix
`WrapRegistryMissError: name='<key>'`	Dataset entry missing an `eval_input` item with the `name` that the app's `wrap(purpose="input", name="<key>")` expects	Add the missing `{"name": "<key>", "value": ...}` to `eval_input` in every affected entry
`WrapTypeMismatchError`	Deserialized type doesn't match what the app expects	Fix the value in the dataset
Runnable resolution failure	`runnable` path or class name is wrong, or the class doesn't implement the `Runnable` protocol	Fix `filepath:ClassName` in the dataset; ensure the class has `create()` and `run()` methods
Import error	Module path or syntax error in runnable/evaluator	Fix the referenced file
`ModuleNotFoundError: pixie_qa`	`pixie_qa/` directory missing `__init__.py`	Run `pixie init` to recreate it
`TypeError: ... is not callable`	Evaluator name points to a non-callable attribute	Evaluators must be functions, classes, or callable instances
`sqlite3.OperationalError`	Concurrent `run()` calls sharing a SQLite connection	Add `asyncio.Semaphore(1)` to the Runnable (see Step 2 concurrency section)
Custom evaluator crashes	Bug in your custom evaluator implementation	Fix the evaluator code

What is NOT a mechanical issue (do NOT fix these here):

Application produces wrong/low-quality output → that's the application's behavior, analyzed in Step 6
Evaluator scores are low → that's a quality signal, analyzed in Step 6
LLM calls fail inside the application → report in Step 6, do not mock or work around
Evaluator scores fluctuate between runs → normal LLM non-determinism, not a bug

Iterate — fix errors, re-run, fix the next error — until pixie test runs to completion with real evaluator scores for all entries.

Output

After pixie test completes successfully, results are stored in the per-entry directory structure:

{PIXIE_ROOT}/results/<test_id>/
  meta.json                           # test run metadata
  dataset-{idx}/
    metadata.json                     # dataset name, path, runnable
    entry-{idx}/
      config.json                     # evaluators, description, expectation
      eval-input.jsonl                # input data fed to evaluators
      eval-output.jsonl               # output data captured from app
      evaluations.jsonl               # evaluation results (scored + pending)
      trace.jsonl                     # LLM call traces (if captured)

The <test_id> is printed in console output. You will reference this directory in Step 6.

---

If you hit an unexpected error when running tests (wrong parameter names, import failures, API mismatch), read wrap-api.md, evaluators.md, or testing-api.md for the authoritative API reference before guessing at a fix.

Step 6: Analyze Outcomes

Why this step: pixie test produced raw scores. Now you analyze those results to understand what they mean — completing pending evaluations, identifying patterns, validating hypotheses, and producing an actionable improvement plan. The analysis is structured in three phases that build on each other: entry-level → dataset-level → action plan.

---

Result directory structure

After pixie test, the result directory looks like:

{PIXIE_ROOT}/results/<test_id>/
  meta.json
  dataset-{idx}/
    metadata.json
    entry-{idx}/
      config.json              # evaluators, description, expectation
      eval-input.jsonl         # input data fed to evaluators
      eval-output.jsonl        # output data captured from app
      evaluations.jsonl        # scored + pending evaluations
      trace.jsonl              # LLM call traces

Read meta.json to find the <test_id>. All the data you need for analysis is in this directory.

---

Hard completion gate

You are the grader for Step 6. Pending evaluations are not a handoff to the user, and the web UI is not a substitute for grading. You may use the web UI to browse traces and outputs, but completion happens by writing files on disk.

Step 6 is incomplete until all of the following are true:

Every "status": "pending" entry in every evaluations.jsonl has been replaced with a scored entry that contains both score and reasoning.
Every dataset directory contains analysis.md and analysis-summary.md.
The test run root contains action-plan.md and action-plan-summary.md.
The verifier script in this skill's resources/ directory passes for the target results directory.

Forbidden shortcuts:

Leaving any "status": "pending" entries in place
Telling the user to review pending evaluations in the web UI
Writing a single top-level substitute file such as pixie_qa/06-analysis.md
Writing phrases like "likely passes" or "probably fails" without scoring the evaluation and updating evaluations.jsonl

If you do any of the above, Step 6 is not done.

Iteration rule

If you are iterating across multiple fix/test cycles, every successful pixie test run creates a new pixie_qa/results/<test_id> directory and a new Step 6 obligation. The moment that directory exists, it becomes the analysis target for the current cycle.

Before you edit application code, prompts, datasets, evaluators, or rerun pixie test, complete Step 6 for that exact results directory. Do not skip earlier cycles and analyze only the last run.

Additional forbidden shortcut:

Do not create a newer pixie_qa/results/<test_id> and leave an older one from the same task without Step 6 artifacts.

---

Writing principles

Every analysis detailed artifact you produce must follow these principles:

Data-driven: Every opinion or statement must be backed by concrete data from the evaluation run. Quote scores, cite entry indices, reference specific eval input/output content. No hand-waving. It is better to write nothing than to write something unsubstantiated.
Evidence-first: Present the raw data and evidence before drawing conclusions. The reader (another coding agent) should be able to independently verify your conclusions from the evidence you cite.
Traceable: For every conclusion, provide the chain: data source → observation → reasoning → conclusion. Another agent should be able to follow this chain backward to verify or challenge any claim.
No selling: Do not advocate, promote, or use value-laden language ("excellent", "robust", "impressive", "well-designed"). State what the data shows and what actions it implies. Let the reader form quality judgments.
Action-oriented: Every analysis should contribute to the end goal of concrete improvements to the evaluation pipeline or application. Do not write observations that don't lead somewhere.

Every persisted analysis summary artifact must follow these principles:

Concise: The human reader should be able to understand the key findings and actions in under 2 minutes for any single artifact.
Conclusions-first: Lead with what the reader needs to know (results, findings, actions), not with methodology or background.
Plain language: Avoid jargon. A non-technical stakeholder should be able to follow the summary.
Consistent: Summary conclusions must match the detailed version's evidence. Never add claims in the summary that aren't supported in the detailed version.

Dual-variant pattern

Every persisted analysis artifact in this step has two files:

Artifact	Detailed file (for agent)	Summary file (for human)
Dataset analysis	`dataset-{idx}/analysis.md`	`dataset-{idx}/analysis-summary.md`
Action plan	`action-plan.md`	`action-plan-summary.md`

Always write the detailed version first, then derive the summary from it. The summary is a strict subset of the detailed version's content — it should never contain claims or conclusions not present in the detailed version.

---

Phase 1: Entry-level grading pass

Process each dataset entry individually. For each dataset-{idx}/entry-{idx}/:

1a. Read the entry data

Read these files for the entry:

config.json — what evaluators were configured, the description, the expectation
eval-input.jsonl — what data was fed to the app/evaluators
eval-output.jsonl — what the app produced
evaluations.jsonl — current evaluation results (scored and pending)
trace.jsonl — what LLM calls the app made (if available)

1b. Complete pending evaluations

If evaluations.jsonl contains entries with "status": "pending", you must grade them:

1. Read the criteria field of the pending evaluation 2. Apply the criteria to the entry's eval input, eval output, and trace data 3. Assign a score between 0.0 and 1.0:

1.0 — fully meets the criteria
0.5–0.9 — partially meets criteria (explain what's missing)
0.0–0.4 — does not meet criteria

4. Write a reasoning string (1–3 sentences citing specific evidence from the output or trace) 5. Replace the pending entry in evaluations.jsonl with the scored result. Do not append a second row and leave the pending row in place. Overwrite the pending row itself.

Before (pending):

{
  "evaluator": "ResponseQuality",
  "status": "pending",
  "criteria": "The response should..."
}

After (scored):

{
  "evaluator": "ResponseQuality",
  "score": 0.85,
  "reasoning": "Response addresses the main question but omits..."
}

Grading guidelines:

Be evidence-based — every score must reference specific output or trace content
Use the criteria literally — do not expand or reinterpret beyond what's written
Consider the trace — distinguish between app logic problems and LLM quality issues
Be calibrated — reserve 1.0 for outputs that genuinely satisfy criteria fully
Do not penalize LLM non-determinism — different phrasing of a correct answer is not a failure
Do not defer to the user — if the evidence is sufficient to write "likely passes", it is sufficient to assign a score and update evaluations.jsonl

1c. Do not persist entry-level analysis files

In this trimmed workflow, do not write `entry-{idx}/analysis.md` or `entry-{idx}/analysis-summary.md`. Phase 1 is only for reading evidence and converting every pending evaluation into a scored row in evaluations.jsonl.

You may take temporary scratch notes while reasoning, but they are not deliverables. Persist only:

updated evaluations.jsonl in each entry directory
dataset-level analysis files in Phase 2
run-level action plan files in Phase 3

---

Phase 2: Dataset-level analysis

After all entries in a dataset are analyzed, produce the dataset-level analysis. Write analysis.md in the dataset directory (dataset-{idx}/analysis.md).

2a. Aggregate the data

Summarize across all entries in the dataset:

Pass/fail counts and overall pass rate
Per-evaluator statistics (pass rate, min/max/mean scores)
Which entries failed which evaluators (failure clusters)

2b. Form and validate hypotheses

Come up with exactly 3 high-confidence hypotheses across these three dimensions:

1. Test cases quality — Does the set of test cases sufficiently and efficiently verify the application's capabilities? Does it cover the important failure modes? Are there blind spots?

2. Evaluation criteria/evaluator quality — Do the evaluators have proper granularity and grading to catch real issues? Are there rubber-stamp evaluators (all 1.0)? Are there flaky evaluators (high variance without code changes)? Are criteria too vague or too strict?

3. Application quality — Based on the evaluation results, what are the application's strengths and weaknesses? Where does it produce high-quality output? Where does it fail?

For each hypothesis:

State the hypothesis clearly in one sentence
Cite the evidence — entry indices, evaluator names, scores, reasoning quotes, trace data
Validate or invalidate — look at the actual eval input/output data and code to confirm or refute
Conclusion — what action does this hypothesis imply?

It is always possible to produce 3 hypotheses even when the data is limited. If the evaluation data doesn't give a conclusive answer on application quality, that itself is a signal about test case or evaluator gaps.

2c. Write the dataset analysis (two files)

Produce two files for the dataset analysis. Write the detailed version first, then derive the summary.

Detailed version: `dataset-{idx}/analysis.md`

This file is for agent consumption — it provides the complete data aggregation, hypothesis formation with evidence chains, and validated conclusions that a coding agent can act on directly.

Writing principles:

Show all the data before interpreting it. Start with the raw aggregation (pass/fail, per-evaluator stats, failure clusters) before any hypotheses. The data should stand on its own.
For each hypothesis, present: data → reasoning → conclusion. The reader should be able to follow your logic step by step and arrive at the same conclusion independently.
Cross-reference raw entry evidence directly. When citing evidence, reference the specific entry index and the underlying files/data points (for example: entry-3/evaluations.jsonl, entry-3/eval-output.jsonl, or entry-3/trace.jsonl).
Distinguish correlation from causation. If two entries fail the same evaluator, that's a pattern. But the root cause might differ — verify by checking the actual output data, don't assume.
Do not speculate without marking it. If a conclusion is uncertain, say "Hypothesis (unvalidated): ..." and explain what additional data would confirm or refute it.

Content:

1. Overview — dataset name, entry count, overall pass rate 2. Raw aggregation data

Per-evaluator statistics table (pass rate, score range, mean, standard deviation)
Failure matrix: entries × evaluators showing scores, highlighting failures
Failure clusters: entries grouped by shared failed evaluators

3. Hypothesis 1: Test cases — hypothesis statement, evidence with entry/evaluator references, validation steps taken, conclusion with specific action 4. Hypothesis 2: Evaluators — same structure 5. Hypothesis 3: Application — same structure 6. Open questions — anything the data doesn't conclusively answer, with suggestions for what additional data would help

Summary version: `dataset-{idx}/analysis-summary.md`

This file is for human review — a scannable overview of the dataset results, key findings, and recommended actions.

Template:

# Dataset Analysis — Summary

**Dataset**: <name> | **Entries**: <N> | **Pass rate**: <X/N (Y%)>

## Results at a glance

| Evaluator | Pass rate | Avg score | Notes                  |
| --------- | --------- | --------- | ---------------------- |
| ...       | ...       | ...       | <one-liner if notable> |

## Key findings

1. <Finding>: <1-2 sentences with the conclusion and its implication>
2. ...
3. ...

## Recommended actions (priority order)

1. <Action>: <what to do and expected impact, 1-2 sentences>
2. ...
3. ...

Maximum ~40 lines for the summary.

---

Phase 3: Action plan (two files)

After all datasets are analyzed, produce the action plan. Write two files at the test run root. Write the detailed version first, then derive the summary.

Detailed version: `{PIXIE_ROOT}/results/<test_id>/action-plan.md`

This file is for agent consumption — it provides specific, implementable improvement items with full evidence trails, so a coding agent can pick up any item and execute it without additional context-gathering.

Writing principles:

Each item must be self-contained. A coding agent reading just one priority item should have enough context (evidence references, file paths, expected changes) to implement it.
Trace every item back to evidence. Each priority must reference: which hypothesis (from which dataset analysis), which entries/evaluators provided the evidence, and what the specific data showed.
Be concrete about "How". Don't say "improve the prompt" — say "In scrapegraphai/prompts/generate_answer.py line 45, add instruction: '...'". The more specific, the more actionable.
Do not include speculative items. Every item must have validated evidence. If an item is based on an unvalidated hypothesis, either validate it first or exclude it.

Structure:

# Action Plan (Detailed)

## Summary

- X datasets analyzed, Y total entries, Z% overall pass rate
- [1-2 sentence high-level assessment]

## Priority 1: [Most impactful improvement]

- **What**: [specific change to make]
- **Why**: [which hypothesis from which dataset analysis, with entry/evaluator references]
- **Evidence**: [specific scores, output excerpts, trace data that support this]
- **Expected impact**: [which entries/evaluators this will improve, and predicted score change]
- **How**: [concrete implementation steps with file paths and line numbers]
- **Verification**: [how to verify the fix worked — which entries to re-run, what scores to expect]

## Priority 2: ...

...

Summary version: `{PIXIE_ROOT}/results/<test_id>/action-plan-summary.md`

This file is for human review — a prioritized list of improvements that a human can understand and approve in under 2 minutes.

Template:

# Action Plan — Summary

**Overall**: <X entries, Y% pass rate. 1-sentence assessment.>

## Actions (priority order)

1. **<Action title>**: <What to change and why, 2-3 sentences. Expected impact.>
2. **<Action title>**: <What to change and why, 2-3 sentences. Expected impact.>
3. ...

Maximum ~30 lines for the summary.

Prioritization criteria:

Systemic issues (affecting multiple entries/datasets) before isolated ones
Issues with clear, validated evidence before speculative ones
Application quality gaps before evaluator refinements before test case additions
Quick fixes before large refactors

The action plan should have 3–5 items. Each must trace back to a validated hypothesis from Phase 2. Do not include items that are speculative or lack evidence.

---

Process summary

1. Phase 1 (per entry): Read data → grade pending evaluations → update evaluations.jsonl 2. Phase 2 (per dataset): Aggregate → form 3 hypotheses → validate → write dataset-{idx}/analysis.md + dataset-{idx}/analysis-summary.md 3. Phase 3 (per test run): Synthesize → prioritize → write action-plan.md + action-plan-summary.md

Process entries within a dataset concurrently (using subagents if available). Process phases sequentially — Phase 2 depends on Phase 1 outputs, Phase 3 depends on Phase 2 outputs.

---

Final verification

Before you end your turn, run the Step 6 verifier script that ships beside setup.sh in this skill's resources/ directory against the exact test run directory you analyzed.

Example shape:

python /path/to/eval-driven-dev/resources/verify_step6_completion.py pixie_qa/results/<test_id>

If the verifier reports any error, keep working. Step 6 is not complete until the verifier passes.

Built-in Evaluators

Auto-generated from pixie source code docstrings.

Do not edit by hand — run uv run python scripts/generate_skill_docs.py.

Autoevals adapters — pre-made evaluators wrapping autoevals scorers.

This module provides :class:AutoevalsAdapter, which bridges the autoevals Scorer interface to pixie's Evaluator protocol, and a set of factory functions for common evaluation tasks.

Public API (all are also re-exported from pixie.evals):

Core adapter: - :class:AutoevalsAdapter — generic wrapper for any autoevals Scorer.

Heuristic scorers (no LLM required): - :func:LevenshteinMatch — edit-distance string similarity. - :func:ExactMatch — exact value comparison. - :func:NumericDiff — normalised numeric difference. - :func:JSONDiff — structural JSON comparison. - :func:ValidJSON — JSON syntax / schema validation. - :func:ListContains — overlap between two string lists.

Embedding scorer: - :func:EmbeddingSimilarity — cosine similarity via embeddings.

LLM-as-judge scorers: - :func:Factuality, :func:ClosedQA, :func:Battle, :func:Humor, :func:Security, :func:Sql, :func:Summary, :func:Translation, :func:Possible.

Moderation: - :func:Moderation — OpenAI content-moderation check.

RAGAS metrics: - :func:ContextRelevancy, :func:Faithfulness, :func:AnswerRelevancy, :func:AnswerCorrectness.

Evaluator Selection Guide

Choose evaluators based on the output type and eval criteria:

Output type	Evaluator category	Examples
Deterministic (labels, yes/no, fixed-format)	Heuristic: `ExactMatch`, `JSONDiff`, `ValidJSON`	Label classification, JSON extraction
Open-ended text with a reference answer	LLM-as-judge: `Factuality`, `ClosedQA`, `AnswerCorrectness`	Chatbot responses, QA, summaries
Text with expected context/grounding	RAG: `Faithfulness`, `ContextRelevancy`	RAG pipelines
Text with style/format requirements	Custom via `create_llm_evaluator`	Voice-friendly responses, tone checks
Multi-aspect quality	Multiple evaluators combined	Factuality + relevance + tone
Trace-dependent quality (tool use, routing)	Agent evaluator via `create_agent_evaluator`	Tool correctness, multi-step reasoning

Critical rules:

For open-ended LLM text, never use ExactMatch — LLM outputs are

non-deterministic.

AnswerRelevancy is RAG-only — requires context in the trace.

Returns 0.0 without it. For general relevance, use create_llm_evaluator.

Do NOT use comparison evaluators (Factuality, ClosedQA,

ExactMatch) on items without expected_output — they produce meaningless scores.

---

Evaluator Reference

`AnswerCorrectness`

AnswerCorrectness(*, client: 'Any' = None) -> 'AutoevalsAdapter'

Answer correctness evaluator (RAGAS).

Judges whether eval_output is correct compared to expected_output, combining factual similarity and semantic similarity.

When to use: QA scenarios in RAG pipelines where you have a reference answer and want a comprehensive correctness score.

Requires `expected_output`: Yes. Requires `eval_metadata["context"]`: Optional (improves accuracy).

Args: client: OpenAI client instance.

`AnswerRelevancy`

AnswerRelevancy(*, client: 'Any' = None) -> 'AutoevalsAdapter'

Answer relevancy evaluator (RAGAS).

Judges whether eval_output directly addresses the question in eval_input.

When to use: RAG pipelines only — requires context in the trace. Returns 0.0 without it. For general (non-RAG) response relevance, use create_llm_evaluator with a custom prompt instead.

Requires `expected_output`: No. Requires `eval_metadata["context"]`: Yes — RAG pipelines only.

Args: client: OpenAI client instance.

`Battle`

Battle(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'

Head-to-head comparison evaluator (LLM-as-judge).

Uses an LLM to compare eval_output against expected_output and determine which is better given the instructions in eval_input.

When to use: A/B testing scenarios, comparing model outputs, or ranking alternative responses.

Requires `expected_output`: Yes.

Args: model: LLM model name. client: OpenAI client instance.

`ClosedQA`

ClosedQA(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'

Closed-book question-answering evaluator (LLM-as-judge).

Uses an LLM to judge whether eval_output correctly answers the question in eval_input compared to expected_output. Optionally forwards eval_metadata["criteria"] for custom grading criteria.

When to use: QA scenarios where the answer should match a reference — e.g. customer support answers, knowledge-base queries.

Requires `expected_output`: Yes — do NOT use on items without expected_output; produces meaningless scores.

Args: model: LLM model name. client: OpenAI client instance.

`ContextRelevancy`

ContextRelevancy(*, client: 'Any' = None) -> 'AutoevalsAdapter'

Context relevancy evaluator (RAGAS).

Judges whether the retrieved context is relevant to the query. Forwards eval_metadata["context"] to the underlying scorer.

When to use: RAG pipelines — evaluating retrieval quality.

Requires `expected_output`: Yes. Requires `eval_metadata["context"]`: Yes (RAG pipelines only).

Args: client: OpenAI client instance.

`EmbeddingSimilarity`

EmbeddingSimilarity(*, prefix: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'

Embedding-based semantic similarity evaluator.

Computes cosine similarity between embedding vectors of eval_output and expected_output.

When to use: Comparing semantic meaning of two texts when exact wording doesn't matter. More robust than Levenshtein for paraphrased content but less nuanced than LLM-as-judge evaluators.

Requires `expected_output`: Yes.

Args: prefix: Optional text to prepend for domain context. model: Embedding model name. client: OpenAI client instance.

`ExactMatch`

ExactMatch() -> 'AutoevalsAdapter'

Exact value comparison evaluator.

Returns 1.0 if eval_output exactly equals expected_output, 0.0 otherwise.

When to use: Deterministic, structured outputs (classification labels, yes/no answers, fixed-format strings). Never use for open-ended LLM text — LLM outputs are non-deterministic, so exact match will almost always fail.

Requires `expected_output`: Yes.

`Factuality`

Factuality(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'

Factual accuracy evaluator (LLM-as-judge).

Uses an LLM to judge whether eval_output is factually consistent with expected_output given the eval_input context.

When to use: Open-ended text where factual correctness matters (chatbot responses, QA answers, summaries). Preferred over ExactMatch for LLM-generated text.

Requires `expected_output`: Yes — do NOT use on items without expected_output; produces meaningless scores.

Args: model: LLM model name. client: OpenAI client instance.

`Faithfulness`

Faithfulness(*, client: 'Any' = None) -> 'AutoevalsAdapter'

Faithfulness evaluator (RAGAS).

Judges whether eval_output is faithful to (i.e. supported by) the provided context. Forwards eval_metadata["context"].

When to use: RAG pipelines — ensuring the answer doesn't hallucinate beyond what the retrieved context supports.

Requires `expected_output`: No. Requires `eval_metadata["context"]`: Yes (RAG pipelines only).

Args: client: OpenAI client instance.

`Humor`

Humor(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'

Humor quality evaluator (LLM-as-judge).

Uses an LLM to judge the humor quality of eval_output against expected_output.

When to use: Evaluating humor in creative writing, chatbot personality, or entertainment applications.

Requires `expected_output`: Yes.

Args: model: LLM model name. client: OpenAI client instance.

`JSONDiff`

JSONDiff(*, string_scorer: 'Any' = None) -> 'AutoevalsAdapter'

Structural JSON comparison evaluator.

Recursively compares two JSON structures and produces a similarity score. Handles nested objects, arrays, and mixed types.

When to use: Structured JSON outputs where field-level comparison is needed (e.g. extracted data, API response schemas, tool call arguments).

Requires `expected_output`: Yes.

Args: string_scorer: Optional pairwise scorer for string fields.

`LevenshteinMatch`

LevenshteinMatch() -> 'AutoevalsAdapter'

Edit-distance string similarity evaluator.

Computes a normalised Levenshtein distance between eval_output and expected_output. Returns 1.0 for identical strings and decreasing scores as edit distance grows.

When to use: Deterministic or near-deterministic outputs where small textual variations are acceptable (e.g. formatting differences, minor spelling). Not suitable for open-ended LLM text — use an LLM-as-judge evaluator instead.

Requires `expected_output`: Yes.

`ListContains`

ListContains(*, pairwise_scorer: 'Any' = None, allow_extra_entities: 'bool' = False) -> 'AutoevalsAdapter'

List overlap evaluator.

Checks whether eval_output contains all items from expected_output. Scores based on overlap ratio.

When to use: Outputs that produce a list of items where completeness matters (e.g. extracted entities, search results, recommendations).

Requires `expected_output`: Yes.

Args: pairwise_scorer: Optional scorer for pairwise element comparison. allow_extra_entities: If True, extra items in output are not penalised.

`Moderation`

Moderation(*, threshold: 'float | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'

Content moderation evaluator.

Uses the OpenAI moderation API to check eval_output for unsafe content (hate speech, violence, self-harm, etc.).

When to use: Any application where output safety is a concern — chatbots, content generation, user-facing AI.

Requires `expected_output`: No.

Args: threshold: Custom flagging threshold. client: OpenAI client instance.

`NumericDiff`

NumericDiff() -> 'AutoevalsAdapter'

Normalised numeric difference evaluator.

Computes a normalised numeric distance between eval_output and expected_output. Returns 1.0 for identical numbers and decreasing scores as the difference grows.

When to use: Numeric outputs where approximate equality is acceptable (e.g. price calculations, scores, measurements).

Requires `expected_output`: Yes.

`Possible`

Possible(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'

Feasibility / plausibility evaluator (LLM-as-judge).

Uses an LLM to judge whether eval_output is a plausible or feasible response.

When to use: General-purpose quality check when you want to verify outputs are reasonable without a specific reference answer.

Requires `expected_output`: No.

Args: model: LLM model name. client: OpenAI client instance.

`Security`

Security(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'

Security vulnerability evaluator (LLM-as-judge).

Uses an LLM to check eval_output for security vulnerabilities based on the instructions in eval_input.

When to use: Code generation, SQL output, or any scenario where output must be checked for injection or vulnerability risks.

Requires `expected_output`: No.

Args: model: LLM model name. client: OpenAI client instance.

`Sql`

Sql(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'

SQL equivalence evaluator (LLM-as-judge).

Uses an LLM to judge whether eval_output SQL is semantically equivalent to expected_output SQL.

When to use: Text-to-SQL applications where the generated SQL should be functionally equivalent to a reference query.

Requires `expected_output`: Yes.

Args: model: LLM model name. client: OpenAI client instance.

`Summary`

Summary(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'

Summarisation quality evaluator (LLM-as-judge).

Uses an LLM to judge the quality of eval_output as a summary compared to the reference summary in expected_output.

When to use: Summarisation tasks where the output must capture key information from the source material.

Requires `expected_output`: Yes.

Args: model: LLM model name. client: OpenAI client instance.

`Translation`

Translation(*, language: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'

Translation quality evaluator (LLM-as-judge).

Uses an LLM to judge the translation quality of eval_output compared to expected_output in the target language.

When to use: Machine translation or multilingual output scenarios.

Requires `expected_output`: Yes.

Args: language: Target language (e.g. "Spanish"). model: LLM model name. client: OpenAI client instance.

`ValidJSON`

ValidJSON(*, schema: 'Any' = None) -> 'AutoevalsAdapter'

JSON syntax and schema validation evaluator.

Returns 1.0 if eval_output is valid JSON (and optionally matches the provided schema), 0.0 otherwise.

When to use: Outputs that must be valid JSON — optionally conforming to a specific schema (e.g. tool call responses, structured extraction).

Requires `expected_output`: No.

Args: schema: Optional JSON Schema to validate against.

---

Custom Evaluators: `create_llm_evaluator`

Factory for custom LLM-as-judge evaluators from prompt templates.

Usage::

from pixie import create_llm_evaluator

concise_voice_style = create_llm_evaluator( name="ConciseVoiceStyle", prompt_template=""" You are evaluating whether a voice agent response is concise and phone-friendly.

User said: {eval_input} Agent responded: {eval_output} Expected behavior: {expectation}

Score 1.0 if the response is concise (under 3 sentences), directly addresses the question, and uses conversational language suitable for a phone call. Score 0.0 if it's verbose, off-topic, or uses written-style formatting. """, )

`create_llm_evaluator`

create_llm_evaluator(name: 'str', prompt_template: 'str', *, model: 'str' = 'gpt-4o-mini', client: 'Any | None' = None) -> '_LLMEvaluator'

Create a custom LLM-as-judge evaluator from a prompt template.

The template may reference these variables (populated from the :class:~pixie.storage.evaluable.Evaluable fields):

{eval_input} — the evaluable's input data. Single-item lists expand

to that item's value; multi-item lists expand to a JSON dict of name → value pairs.

{eval_output} — the evaluable's output data (same rule as

eval_input).

{expectation} — the evaluable's expected output

Args: name: Display name for the evaluator (shown in scorecard). prompt_template: A string template with {eval_input}, {eval_output}, and/or {expectation} placeholders. model: OpenAI model name (default: gpt-4o-mini). client: Optional pre-configured OpenAI client instance.

Returns: An evaluator callable satisfying the Evaluator protocol.

Raises: ValueError: If the template uses nested field access like {eval_input[key]} (only top-level placeholders are supported).

`create_agent_evaluator`

create_agent_evaluator(name: 'str', criteria: 'str') -> '_AgentEvaluator'

Create an evaluator whose grading is deferred to a coding agent.

During pixie test, agent evaluators are not scored automatically. Instead, they raise AgentEvaluationPending and record a PendingEvaluation with the evaluation criteria. The coding agent (guided by Step 6) reviews each entry's trace and output, then grades the pending evaluations.

When to use: Quality dimensions that require holistic review of the LLM trace — tool call correctness, multi-step reasoning quality, routing decisions — where an automated LLM-as-judge prompt can't capture the nuance.

When NOT to use: Simple text quality checks (use create_llm_evaluator instead), deterministic checks (use heuristic evaluators), or any criterion that can be scored from input + output alone without trace context.

Args: name: Display name for the evaluator (shown in scorecard as ⏳ pending). criteria: What to evaluate — the grading instructions the agent will follow when reviewing results. Be specific and actionable.

Returns: An evaluator callable satisfying the Evaluator protocol. Its __call__ raises AgentEvaluationPending instead of returning an Evaluation.

Example:

from pixie import create_agent_evaluator

ResponseQuality = create_agent_evaluator(
    name="ResponseQuality",
    criteria="The response directly addresses the user's question with "
             "accurate, well-structured information. No hallucinations "
             "or off-topic content.",
)

ToolUsageCorrectness = create_agent_evaluator(
    name="ToolUsageCorrectness",
    criteria="The app called the correct tools in the right order based "
             "on the user's intent. No unnecessary or missed tool calls.",
)

Runnable Example: CLI Application

When the app is invoked from the command line (e.g., python -m myapp, a CLI tool with argparse/click).

Approach: Use asyncio.create_subprocess_exec to invoke the CLI and capture output.

# pixie_qa/run_app.py
import asyncio
import sys

from pydantic import BaseModel
import pixie


class AppArgs(BaseModel):
    query: str


class AppRunnable(pixie.Runnable[AppArgs]):
    """Drives a CLI application via subprocess."""

    @classmethod
    def create(cls) -> "AppRunnable":
        return cls()

    async def run(self, args: AppArgs) -> None:
        proc = await asyncio.create_subprocess_exec(
            sys.executable, "-m", "myapp", "--query", args.query,
            stdout=asyncio.subprocess.PIPE,
            stderr=asyncio.subprocess.PIPE,
        )
        stdout, stderr = await asyncio.wait_for(proc.communicate(), timeout=120)
        if proc.returncode != 0:
            raise RuntimeError(f"App failed (exit {proc.returncode}): {stderr.decode()}")

When the CLI needs patched dependencies

If the CLI reads from external services, create a wrapper entry point that patches dependencies before running the real CLI:

# pixie_qa/patched_app.py
"""Entry point that patches external deps before running the real CLI."""
import myapp.config as config
config.redis_url = "mock://localhost"

from myapp.main import main
main()

Then point your Runnable at the wrapper:

async def run(self, args: AppArgs) -> None:
    proc = await asyncio.create_subprocess_exec(
        sys.executable, "-m", "pixie_qa.patched_app", "--query", args.query,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE,
    )
    stdout, stderr = await asyncio.wait_for(proc.communicate(), timeout=120)

Note: For CLI apps, wrap(purpose="input") injection only works when the app runs in the same process. If using subprocess, you may need to pass test data via environment variables or config files instead.

Related skills

TddFollow test-driven development with a strict red-green-refactor loop when creating reliable features or fixing bugs.510k185k

Test Driven DevelopmentEnforce writing failing tests before any production implementation code.176k260k

QaRun conversational QA sessions that turn user-reported bugs into well-written, domain-aware GitHub issues without manual ticket writing.164k185k

Migrate To ShoehornAutomatically update TypeScript test files that rely on unsafe `as` type assertions by replacing them with type-safe partial objects from @total-typescript/shoehorn.151k185k

Webapp TestingVerify frontend behavior, debug UI issues, capture screenshots, and inspect logs of a running local web application using Playwright.121k164k

Playwright CliRun browser automation, generate element snapshots, inspect DOM attributes, and execute Playwright tests from the terminal.96.3k12.2k

How it compares

Pick eval-driven-dev over generic testing skills when LLM or agent eval design must reflect real product behavior before writing instrumentation.

FAQ

Can eval-driven-dev mock the LLM during tests?

No. The app's LLM calls must hit a real model; mocking makes eval scores tautological and is explicitly forbidden in the Runnable.

What is the final deliverable?

A completed pixie test run with real evaluator scores, pending evaluations resolved, and action-plan.md plus analysis summaries.

Which Python version does eval-driven-dev require?

Python 3.10 or newer with pixie-qa version 0.8.4 or newer installed via the skill setup script.

Is Eval Driven Dev safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Testing & QAtesting

About

Eval Driven Dev by the numbers

eval-driven-dev capabilities & compatibility

What eval-driven-dev says it does

Add your badge

How do I add LLM evals to a Python app with instrumented traces, golden datasets, and pass-fail pixie test scores?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Eval-Driven Development for Python LLM Applications

Before you start

The workflow

Step 1: Understand the app and define eval criteria

Sub-step 1a: Project analysis

Sub-step 1b: Entry point & execution flow

Sub-step 1c: Eval criteria

Step 2: Instrument, run application, and capture a reference trace

Sub-step 2a: Instrument with wrap

Sub-step 2b: Implement the Runnable

Sub-step 2c: Capture and verify a reference trace

Step 3: Define evaluators

Step 4: Build the dataset

Step 5: Run pixie test and fix mechanical issues

Step 6: Analyze outcomes

Web Server Management

Step 1a: Project Analysis

What to investigate

1. What does this software do?

2. Who uses it and why?

3. Capability inventory

4. What are realistic inputs?

5. What are the hard problems / failure modes?

Output: pixie_qa/00-project-analysis.md

Template

Quality check

What to ignore in the project

Step 1b: Entry Point & Execution Flow

What to investigate

1. How the software runs

2. The real user entry point

3. Environment and configuration

Output: pixie_qa/01-entry-point.md

Template

Step 1c: Eval Criteria

What to define

1. Use cases

2. Eval criteria

3. Check criteria applicability and observability

Projects with multiple capabilities

Criteria quality gate (mandatory self-check)

Output: pixie_qa/02-eval-criteria.md

Template

Step 2a: Instrument with wrap

Data-flow analysis

Adding wrap() calls

Value vs. function wrapping

Placement rules

Placement by purpose

purpose="input" — where external data enters

purpose="output" — where processed data exits

purpose="state" — internal decisions relevant to evaluation

Coverage check

Output

Step 2b: Implement the Runnable

The core idea

Four requirements

1. Run the real production code

2. Represent start-up args with a Pydantic BaseModel

3. Be concurrency-safe

4. Adhere to the Runnable interface

Minimal example

Architecture-specific examples

File placement

Technical note

Output

Step 2c: Capture and verify a reference trace

Choose the trace input

Capture multiple traces

Sub-step 2a: Instrument with `wrap`

Step 5: Run `pixie test` and fix mechanical issues

Output: `pixie_qa/00-project-analysis.md`

Output: `pixie_qa/01-entry-point.md`

Output: `pixie_qa/02-eval-criteria.md`

Step 2a: Instrument with `wrap`

Adding `wrap()` calls

`purpose="input"` — where external data enters

`purpose="output"` — where processed data exits

`purpose="state"` — internal decisions relevant to evaluation

Run `pixie trace`

Agent evaluators (`create_agent_evaluator`) — the default

Understanding `input_data`, `eval_input`, and `expectation`

4b. Inspect data shapes with `pixie format`

4b′. Capture external content for `eval_input` (mandatory)

Using `eval_input` values

Step 5: Run `pixie test` and Fix Mechanical Issues

Detailed version: `dataset-{idx}/analysis.md`

Summary version: `dataset-{idx}/analysis-summary.md`

Detailed version: `{PIXIE_ROOT}/results/<test_id>/action-plan.md`

Summary version: `{PIXIE_ROOT}/results/<test_id>/action-plan-summary.md`

`AnswerCorrectness`

`AnswerRelevancy`

`Battle`

`ClosedQA`

`ContextRelevancy`

`EmbeddingSimilarity`