Ai Evals

Name: Ai Evals
Author: refoundai

refoundai/lenny-skills

1.7k installs
1.2k repo stars
Updated July 16, 2026
refoundai/lenny-skills

Systematic frameworks for measuring AI output quality through structured evaluation methodologies, rubrics, and test case design.

About

AI evals are systematic frameworks for measuring AI product quality, treating evaluation as a core product specification rather than optional testing. This skill guides developers through understanding what success means for their AI features, designing evaluation approaches with appropriate rubrics and test cases, and implementing measurement systems grounded in user needs. The process involves manual review to identify failure patterns, open coding for error classification, and validation of scoring criteria. Evals require discipline: binary Pass/Fail decisions over vague scales, human-validated LLM-as-judge approaches, and iterative refinement based on actual usage patterns. Mastering evals is essential for shipping AI products that reliably meet user expectations.

Design structured evaluation rubrics and test cases grounded in failure pattern analysis
Validate LLM-as-judge approaches against human expert baselines
Convert product requirements into measurable, binary success criteria
Iterate eval frameworks based on manual trace review and error clustering
Align technical metrics with actual user needs and product outcomes

Ai Evals by the numbers

1,747 all-time installs (skills.sh)
+46 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #701 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/refoundai/lenny-skills --skill ai-evals

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/refoundai/lenny-skills/ai-evals.svg)](https://skillselion.com/skills/refoundai/lenny-skills/ai-evals)

Installs	1.7k
repo stars	★ 1.2k
Security audit	3 / 3 scanners passed
Last updated	July 16, 2026
Repository	refoundai/lenny-skills ↗

What it does

Design and implement systematic AI evaluations for LLM products, measuring model quality through structured rubrics and test cases.

Who is it for?

Teams building AI features who need objective quality gates; LLM product managers validating model improvements; engineers designing AI evaluation infrastructure.

Skip if: Evaluation of non-AI software; simple pass/fail unit testing; cosmetic UI/UX validation.

When should I use this skill?

Building or improving AI features; validating model changes; designing quality metrics for AI products; investigating why AI outputs fail user expectations.

What you get

Developers can confidently ship AI features with measurable quality guarantees, validated through structured evaluations grounded in user requirements.

eval rubrics
error-analysis reports
iteration backlog

By the numbers

Compiles insights from 2 guests and 2 mentions

Files

SKILL.mdMarkdownGitHub ↗

AI Evals

Help the user create systematic evaluations for AI products using insights from AI practitioners.

How to Help

When the user asks for help with AI evals:

1. Understand what they're evaluating - Ask what AI feature or model they're testing and what "good" looks like 2. Help design the eval approach - Suggest rubrics, test cases, and measurement methods 3. Guide implementation - Help them think through edge cases, scoring criteria, and iteration cycles 4. Connect to product requirements - Ensure evals align with actual user needs, not just technical metrics

Core Principles

Evals are the new PRD

Brendan Foody: "If the model is the product, then the eval is the product requirement document." Evals define what success looks like in AI products—they're not optional quality checks, they're core specifications.

Evals are a core product skill

Hamel Husain & Shreya Shankar: "Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders." This isn't just for ML engineers—product people need to master this.

The workflow matters

Building good evals involves error analysis, open coding (writing down what's wrong), clustering failure patterns, and creating rubrics. It's a systematic process, not a one-time test.

Questions to Help Users

"What does 'good' look like for this AI output?"
"What are the most common failure modes you've seen?"
"How will you know if the model got better or worse?"
"Are you measuring what users actually care about?"
"Have you manually reviewed enough outputs to understand failure patterns?"

Common Mistakes to Flag

Skipping manual review - You can't write good evals without first understanding failure patterns through manual trace analysis
Using vague criteria - "The output should be good" isn't an eval; you need specific, measurable criteria
LLM-as-judge without validation - If using an LLM to judge, you must validate that judge against human experts
Likert scales over binary - Force Pass/Fail decisions; 1-5 scales produce meaningless averages

Deep Dive

For all 2 insights from 2 guests, see references/guest-insights.md

Related Skills

Building with LLMs
AI Product Strategy
Evaluating New Technology

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

Pick ai-evals when you need a structured pre-ship eval methodology rather than generic prompt-engineering tips or classic test-pyramid guidance.

FAQ

What's the difference between evals and unit tests?

Unit tests validate code logic; evals measure whether AI outputs meet product requirements. Evals require manual review to identify failure patterns, subjective judgment criteria, and iteration based on real usage. They're product specifications, not functional tests.

Should I use an LLM to judge my evaluations?

Only with human validation. If using LLM-as-judge, first validate that the judge's scoring aligns with expert human reviewers on a representative sample. Without this calibration, you're measuring the judge's biases, not output quality.

How many test cases do I need?

Start with manual review of 50-100 real outputs to understand failure modes and clusters. Use that to design representative test cases. Quality of coverage matters more than quantity; aim for cases that stress failure modes you've identified.

Is Ai Evals safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingllmresearchautomation