
Promptfoo Evaluation
Configure and run Promptfoo evaluations—providers, assertions, and zero-cost echo runs—so you can regression-test prompts before shipping agent features.
Overview
Promptfoo Evaluation is an agent skill most often used in Ship (also Build) that documents Promptfoo provider and assertion configuration for regression-testing LLM prompts.
Install
npx skills add https://github.com/daymade/claude-code-skills --skill promptfoo-evaluationWhat is this skill?
- Echo provider returns rendered prompts with no API calls and zero token cost
- Ready-made Anthropic and OpenAI provider YAML with temperature and max_tokens
- Multi-provider A/B labels for side-by-side model comparison
- Python AssertionContext fields for custom pass/fail logic on LLM outputs
- Documented use cases: debug variables, verify few-shot structure, dry-run config
- Echo provider: no API calls, zero tokens consumed
Adoption & trust: 586 installs on skills.sh; 1.2k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You changed a system prompt or few-shot block and have no cheap, repeatable way to preview renders and assert outputs across models.
Who is it for?
Indie builders maintaining agent skills or chat features who already use or want Promptfoo as the eval harness.
Skip if: Teams that do not version prompts or who need full application E2E tests unrelated to LLM IO.
When should I use this skill?
When configuring Promptfoo providers, assertions, or zero-cost echo preview runs for prompt regression
What do I get? / Deliverables
You get working Promptfoo YAML patterns, echo dry-runs, and assertion hooks so eval suites can run before merge or release.
- Provider and assertion YAML snippets
- Eval-ready prompt preview and test case patterns
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Prompt regression and provider A/B checks belong on the Ship shelf as testing work you do before trusting prompts in production. Testing subphase fits because the skill documents eval providers, assertions, and preview runs rather than building UI or deploying infra.
Where it fits
Run echo provider on nightly prompt config to catch broken variable substitution before release.
Add labeled Claude vs GPT providers to compare tool-calling instructions while authoring a skill.
Smoke-test a landing-page chatbot prompt matrix with assertions before committing to full build.
Re-run promptfoo suite after copy changes to lifecycle emails generated by an LLM template.
How it compares
Skill reference for Promptfoo configs—not a hosted eval platform or a replacement for unit tests on non-LLM code.
Common Questions / FAQ
Who is Promptfoo Evaluation for?
Solo developers and small teams building Claude- or GPT-backed agents who want copy-paste provider and assertion recipes for Promptfoo.
When should I use Promptfoo Evaluation?
During Ship testing before release, and during Build agent-tooling when designing prompt suites, echo previews, or multi-model A/B configs.
Is Promptfoo Evaluation safe to install?
The skill is documentation-heavy; any real evals use your API keys—check the Security Audits panel on this page and scan configs before committing secrets.
SKILL.md
READMESKILL.md - Promptfoo Evaluation
Security scan passed Scanned at: 2026-03-02T20:00:16.607484 Tool: gitleaks + pattern-based validation Content hash: 058a48a82477727772269754ab2bae5bb1f575fc264a1e28f1a2cfad25656b95 # Promptfoo API Reference ## Provider Configuration ### Echo Provider (No API Calls) ```yaml providers: - echo # Returns prompt as-is, no API calls ``` **Use cases:** - Preview rendered prompts without cost - Debug variable substitution - Verify few-shot structure - Test configuration before production runs **Cost:** Free - no tokens consumed. ### Anthropic ```yaml providers: - id: anthropic:messages:claude-sonnet-4-6 config: max_tokens: 4096 temperature: 0.7 # For relay/proxy APIs: # apiBaseUrl: https://your-relay.example.com/api ``` ### OpenAI ```yaml providers: - id: openai:gpt-4.1 config: temperature: 0.5 max_tokens: 2048 ``` ### Multiple Providers (A/B Testing) ```yaml providers: - id: anthropic:messages:claude-sonnet-4-6 label: Claude - id: openai:gpt-4.1 label: GPT-4.1 ``` ## Assertion Reference ### Python Assertion Context ```python class AssertionContext: prompt: str # Raw prompt sent to LLM vars: dict # Test case variables test: dict # Complete test case config: dict # Assertion config provider: Any # Provider info providerResponse: Any # Full response ``` ### GradingResult Format ```python { "pass": bool, # Required: pass/fail "score": float, # 0.0-1.0 score "reason": str, # Explanation "named_scores": dict, # Custom metrics "component_results": [] # Nested results } ``` ### Assertion Types | Type | Description | Parameters | |------|-------------|------------| | `contains` | Substring check | `value` | | `icontains` | Case-insensitive | `value` | | `equals` | Exact match | `value` | | `regex` | Pattern match | `value` | | `not-contains` | Absence check | `value` | | `starts-with` | Prefix check | `value` | | `contains-any` | Any substring | `value` (array) | | `contains-all` | All substrings | `value` (array) | | `cost` | Token cost | `threshold` | | `latency` | Response time | `threshold` (ms) | | `perplexity` | Model confidence | `threshold` | | `python` | Custom Python | `value` (file/code) | | `javascript` | Custom JS | `value` (code) | | `llm-rubric` | LLM grading | `value`, `threshold` | | `factuality` | Fact checking | `value` (reference) | | `model-graded-closedqa` | Q&A grading | `value` | | `similar` | Semantic similarity | `value`, `threshold` | ## Test Case Configuration ### Full Test Case Structure ```yaml - description: "Test name" vars: var1: "value" var2: file://path.txt assert: - type: contains value: "expected" metadata: category: "test-category" priority: high options: provider: specific-provider transform: "output.trim()" ``` ### Loading Variables from Files ```yaml vars: # Text file (loaded as string) content: file://data/input.txt # JSON/YAML (parsed to object) config: file://config.json # Python script (executed, returns value) dynamic: file://scripts/generate.py # PDF (text extracted) document: file://docs/report.pdf # Image (base64 encoded) image: file://images/photo.png ``` ## Advanced Patterns ### Dynamic Test Generation (Python) ```python # tests/generate.py def get_tests(): return [ { "vars": {"input": f"test {i}"}, "assert": [{"type": "contains", "value": str(i)}] } for i in range(10) ] ``` ```yaml tests: file://tests/generate.py:get_tests ``` ### Scenario-based Testing ```yaml scenarios: - config: - vars: language: "French" - vars: language: "Spanish" test