Promptfoo Evaluation

Name: Promptfoo Evaluation
Author: daymade

daymade/claude-code-skills

Configure and run Promptfoo evaluations—providers, assertions, and zero-cost echo runs—so you can regression-test prompts before shipping agent features.

Overview

Promptfoo Evaluation is an agent skill most often used in Ship (also Build) that documents Promptfoo provider and assertion configuration for regression-testing LLM prompts.

Install

npx skills add https://github.com/daymade/claude-code-skills --skill promptfoo-evaluation

What is this skill?

Echo provider returns rendered prompts with no API calls and zero token cost
Ready-made Anthropic and OpenAI provider YAML with temperature and max_tokens
Multi-provider A/B labels for side-by-side model comparison
Python AssertionContext fields for custom pass/fail logic on LLM outputs
Documented use cases: debug variables, verify few-shot structure, dry-run config
Echo provider: no API calls, zero tokens consumed

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 586 installs on skills.sh; 1.2k GitHub stars; 3/3 security scanners passed (skills.sh audits).

What problem does it solve?

You changed a system prompt or few-shot block and have no cheap, repeatable way to preview renders and assert outputs across models.

Who is it for?

Indie builders maintaining agent skills or chat features who already use or want Promptfoo as the eval harness.

Skip if: Teams that do not version prompts or who need full application E2E tests unrelated to LLM IO.

When should I use this skill?

When configuring Promptfoo providers, assertions, or zero-cost echo preview runs for prompt regression

What do I get? / Deliverables

You get working Promptfoo YAML patterns, echo dry-runs, and assertion hooks so eval suites can run before merge or release.

Provider and assertion YAML snippets
Eval-ready prompt preview and test case patterns

Recommended Skills

Microsoft Foundrymicrosoft/azure-skills

Microsoft Foundry skill guides agents through the full Azure AI Foundry lifecycle—containerizing agents, pushing to ACR,…377k installs·1.2k stars

Azure Aimicrosoft/azure-skills

azure-ai is a Prism-oriented quick reference for Microsoft Azure AI work, with the published body centered on the Azure …375k installs·1.2k stars

Azure Hosted Copilot Sdkmicrosoft/azure-skills

Azure Hosted Copilot SDK is Microsoft's entry skill for repos using @github/copilot-sdk—it detects CopilotClient usage, …346k installs·1.2k stars

Lark Eventlarksuite/cli

Lark real-time subscription skill via lark-cli event consume for building bots and streaming webhook-style agent workers…208k installs·13.7k stars

Running Claude Code Via Litellm Copilotxixu-me/skills

Running Claude Code via LiteLLM Copilot walks through pointing Claude Code at a local LiteLLM proxy that forwards Anthro…200k installs·61 stars

Setup Matt Pocock Skillsmattpocock/skills

One-time per-repo setup so Matt Pocock engineering skills share correct issue tracker, triage strings, and domain docume…180k installs·121k stars

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Prompt regression and provider A/B checks belong on the Ship shelf as testing work you do before trusting prompts in production. Testing subphase fits because the skill documents eval providers, assertions, and preview runs rather than building UI or deploying infra.

Also useful

BuildAgent skills & templates

Also useful

ValidatePrototype & spike

Where it fits

Example use

ShipTesting & QA

Run echo provider on nightly prompt config to catch broken variable substitution before release.

Example use

BuildAgent skills & templates

Add labeled Claude vs GPT providers to compare tool-calling instructions while authoring a skill.

Example use

ValidatePrototype & spike

Smoke-test a landing-page chatbot prompt matrix with assertions before committing to full build.

Example use

GrowContent & marketing

Re-run promptfoo suite after copy changes to lifecycle emails generated by an LLM template.

How it compares

Skill reference for Promptfoo configs—not a hosted eval platform or a replacement for unit tests on non-LLM code.

Common Questions / FAQ

Who is Promptfoo Evaluation for?

Solo developers and small teams building Claude- or GPT-backed agents who want copy-paste provider and assertion recipes for Promptfoo.

When should I use Promptfoo Evaluation?

During Ship testing before release, and during Build agent-tooling when designing prompt suites, echo previews, or multi-model A/B configs.

Is Promptfoo Evaluation safe to install?

The skill is documentation-heavy; any real evals use your API keys—check the Security Audits panel on this page and scan configs before committing secrets.

SKILL.md

READMESKILL.md - Promptfoo Evaluation

Security scan passed
Scanned at: 2026-03-02T20:00:16.607484
Tool: gitleaks + pattern-based validation
Content hash: 058a48a82477727772269754ab2bae5bb1f575fc264a1e28f1a2cfad25656b95


# Promptfoo API Reference



## Provider Configuration



### Echo Provider (No API Calls)



```yaml

providers:

  - echo  # Returns prompt as-is, no API calls

```



**Use cases:**

- Preview rendered prompts without cost

- Debug variable substitution

- Verify few-shot structure

- Test configuration before production runs



**Cost:** Free - no tokens consumed.



### Anthropic



```yaml

providers:

  - id: anthropic:messages:claude-sonnet-4-6

    config:

      max_tokens: 4096

      temperature: 0.7

      # For relay/proxy APIs:

      # apiBaseUrl: https://your-relay.example.com/api

```



### OpenAI



```yaml

providers:

  - id: openai:gpt-4.1

    config:

      temperature: 0.5

      max_tokens: 2048

```



### Multiple Providers (A/B Testing)



```yaml

providers:

  - id: anthropic:messages:claude-sonnet-4-6

    label: Claude

  - id: openai:gpt-4.1

    label: GPT-4.1

```



## Assertion Reference



### Python Assertion Context



```python

class AssertionContext:

    prompt: str              # Raw prompt sent to LLM

    vars: dict               # Test case variables

    test: dict               # Complete test case

    config: dict             # Assertion config

    provider: Any            # Provider info

    providerResponse: Any    # Full response

```



### GradingResult Format



```python

{

    "pass": bool,           # Required: pass/fail

    "score": float,         # 0.0-1.0 score

    "reason": str,          # Explanation

    "named_scores": dict,   # Custom metrics

    "component_results": [] # Nested results

}

```



### Assertion Types



| Type | Description | Parameters |

|------|-------------|------------|

| `contains` | Substring check | `value` |

| `icontains` | Case-insensitive | `value` |

| `equals` | Exact match | `value` |

| `regex` | Pattern match | `value` |

| `not-contains` | Absence check | `value` |

| `starts-with` | Prefix check | `value` |

| `contains-any` | Any substring | `value` (array) |

| `contains-all` | All substrings | `value` (array) |

| `cost` | Token cost | `threshold` |

| `latency` | Response time | `threshold` (ms) |

| `perplexity` | Model confidence | `threshold` |

| `python` | Custom Python | `value` (file/code) |

| `javascript` | Custom JS | `value` (code) |

| `llm-rubric` | LLM grading | `value`, `threshold` |

| `factuality` | Fact checking | `value` (reference) |

| `model-graded-closedqa` | Q&A grading | `value` |

| `similar` | Semantic similarity | `value`, `threshold` |



## Test Case Configuration



### Full Test Case Structure



```yaml

- description: "Test name"

  vars:

    var1: "value"

    var2: file://path.txt

  assert:

    - type: contains

      value: "expected"

  metadata:

    category: "test-category"

    priority: high

  options:

    provider: specific-provider

    transform: "output.trim()"

```



### Loading Variables from Files



```yaml

vars:

  # Text file (loaded as string)

  content: file://data/input.txt



  # JSON/YAML (parsed to object)

  config: file://config.json



  # Python script (executed, returns value)

  dynamic: file://scripts/generate.py



  # PDF (text extracted)

  document: file://docs/report.pdf



  # Image (base64 encoded)

  image: file://images/photo.png

```



## Advanced Patterns



### Dynamic Test Generation (Python)



```python

# tests/generate.py

def get_tests():

    return [

        {

            "vars": {"input": f"test {i}"},

            "assert": [{"type": "contains", "value": str(i)}]

        }

        for i in range(10)

    ]

```



```yaml

tests: file://tests/generate.py:get_tests

```



### Scenario-based Testing



```yaml

scenarios:

  - config:

      - vars:

          language: "French"

      - vars:

          language: "Spanish"

    test

What is this skill?

Echo provider returns rendered prompts with no API calls and zero token cost

Ready-made Anthropic and OpenAI provider YAML with temperature and max_tokens

Multi-provider A/B labels for side-by-side model comparison

Python AssertionContext fields for custom pass/fail logic on LLM outputs

Documented use cases: debug variables, verify few-shot structure, dry-run config

Echo provider: no API calls, zero tokens consumed

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 586 installs on skills.sh; 1.2k GitHub stars; 3/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

BuildAgent skills & templates

Also useful

ValidatePrototype & spike

Where it fits

Example use

ShipTesting & QA

Run echo provider on nightly prompt config to catch broken variable substitution before release.

Example use

BuildAgent skills & templates

Add labeled Claude vs GPT providers to compare tool-calling instructions while authoring a skill.

Example use

ValidatePrototype & spike

Smoke-test a landing-page chatbot prompt matrix with assertions before committing to full build.

Example use

GrowContent & marketing

Re-run promptfoo suite after copy changes to lifecycle emails generated by an LLM template.

SKILL.md

READMESKILL.md - Promptfoo Evaluation

Security scan passed
Scanned at: 2026-03-02T20:00:16.607484
Tool: gitleaks + pattern-based validation
Content hash: 058a48a82477727772269754ab2bae5bb1f575fc264a1e28f1a2cfad25656b95


# Promptfoo API Reference



## Provider Configuration



### Echo Provider (No API Calls)



```yaml

providers:

  - echo  # Returns prompt as-is, no API calls

```



**Use cases:**

- Preview rendered prompts without cost

- Debug variable substitution

- Verify few-shot structure

- Test configuration before production runs



**Cost:** Free - no tokens consumed.



### Anthropic



```yaml

providers:

  - id: anthropic:messages:claude-sonnet-4-6

    config:

      max_tokens: 4096

      temperature: 0.7

      # For relay/proxy APIs:

      # apiBaseUrl: https://your-relay.example.com/api

```



### OpenAI



```yaml

providers:

  - id: openai:gpt-4.1

    config:

      temperature: 0.5

      max_tokens: 2048

```



### Multiple Providers (A/B Testing)



```yaml

providers:

  - id: anthropic:messages:claude-sonnet-4-6

    label: Claude

  - id: openai:gpt-4.1

    label: GPT-4.1

```



## Assertion Reference



### Python Assertion Context



```python

class AssertionContext:

    prompt: str              # Raw prompt sent to LLM

    vars: dict               # Test case variables

    test: dict               # Complete test case

    config: dict             # Assertion config

    provider: Any            # Provider info

    providerResponse: Any    # Full response

```



### GradingResult Format



```python

{

    "pass": bool,           # Required: pass/fail

    "score": float,         # 0.0-1.0 score

    "reason": str,          # Explanation

    "named_scores": dict,   # Custom metrics

    "component_results": [] # Nested results

}

```



### Assertion Types



| Type | Description | Parameters |

|------|-------------|------------|

| `contains` | Substring check | `value` |

| `icontains` | Case-insensitive | `value` |

| `equals` | Exact match | `value` |

| `regex` | Pattern match | `value` |

| `not-contains` | Absence check | `value` |

| `starts-with` | Prefix check | `value` |

| `contains-any` | Any substring | `value` (array) |

| `contains-all` | All substrings | `value` (array) |

| `cost` | Token cost | `threshold` |

| `latency` | Response time | `threshold` (ms) |

| `perplexity` | Model confidence | `threshold` |

| `python` | Custom Python | `value` (file/code) |

| `javascript` | Custom JS | `value` (code) |

| `llm-rubric` | LLM grading | `value`, `threshold` |

| `factuality` | Fact checking | `value` (reference) |

| `model-graded-closedqa` | Q&A grading | `value` |

| `similar` | Semantic similarity | `value`, `threshold` |



## Test Case Configuration



### Full Test Case Structure



```yaml

- description: "Test name"

  vars:

    var1: "value"

    var2: file://path.txt

  assert:

    - type: contains

      value: "expected"

  metadata:

    category: "test-category"

    priority: high

  options:

    provider: specific-provider

    transform: "output.trim()"

```



### Loading Variables from Files



```yaml

vars:

  # Text file (loaded as string)

  content: file://data/input.txt



  # JSON/YAML (parsed to object)

  config: file://config.json



  # Python script (executed, returns value)

  dynamic: file://scripts/generate.py



  # PDF (text extracted)

  document: file://docs/report.pdf



  # Image (base64 encoded)

  image: file://images/photo.png

```



## Advanced Patterns



### Dynamic Test Generation (Python)



```python

# tests/generate.py

def get_tests():

    return [

        {

            "vars": {"input": f"test {i}"},

            "assert": [{"type": "contains", "value": str(i)}]

        }

        for i in range(10)

    ]

```



```yaml

tests: file://tests/generate.py:get_tests

```



### Scenario-based Testing



```yaml

scenarios:

  - config:

      - vars:

          language: "French"

      - vars:

          language: "Spanish"

    test

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is Promptfoo Evaluation for?

When should I use Promptfoo Evaluation?

Is Promptfoo Evaluation safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is Promptfoo Evaluation for?

When should I use Promptfoo Evaluation?

Is Promptfoo Evaluation safe to install?

SKILL.md