
Llm Evaluation
Design automated metrics, human evals, and benchmarks so you can trust LLM features before and after production changes.
Overview
LLM Evaluation is an agent skill most often used in Ship (also Build) that implements automated metrics, human feedback, and benchmarking for LLM application quality.
Install
npx skills add https://github.com/wshobson/agents --skill llm-evaluationWhat is this skill?
- Covers automated metrics (BLEU, ROUGE, BERTScore, perplexity, classification, and RAG retrieval scores)
- Human evaluation and A/B testing patterns for judging quality beyond cheap proxies
- Workflow for baselines, regression detection, and validating prompt or model swaps
- Debugging unexpected model behavior with structured eval loops
- Framed for production LLM applications—not one-off demo prompts
- 3 core evaluation types: automated metrics, human evaluation, and structured comparison workflows
Adoption & trust: 7.7k installs on skills.sh; 36.5k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your LLM feature looks fine in chat but you lack systematic scores, baselines, or regression checks before you ship or change prompts.
Who is it for?
Indie builders shipping RAG chat, agents, or AI APIs who need reproducible quality gates without a dedicated ML platform team.
Skip if: Static sites with no LLM components or teams that already run a mature offline eval platform with signed-off production SLOs.
When should I use this skill?
Testing LLM performance, measuring AI application quality, comparing models or prompts, or establishing evaluation frameworks.
What do I get? / Deliverables
You define eval suites, metrics, and comparison runs so prompt and model changes prove improvement instead of guessing from vibes.
- Evaluation plan with metric choices per task type (generation, classification, RAG)
- Baseline scores and regression check procedure for prompt or model changes
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Pre-deployment measurement, regression detection, and production confidence are Ship/testing concerns even when you prototype models during Build. The skill targets systematic performance testing, A/B comparisons, and baselines—classic pre-ship quality gates for LLM apps.
Where it fits
Draft an offline eval set while building tools and prompts for your coding agent product.
Run automated and human evals to gate a model upgrade before merge to main.
Reuse baseline metrics to interpret drift signals and plan prompt fixes.
How it compares
Evaluation design and harness guidance for LLM apps—not a hosted observability product or generic unit-test skill.
Common Questions / FAQ
Who is llm-evaluation for?
Solo developers building LLM-powered features who must measure quality, compare models or prompts, and catch regressions without a full data-science org.
When should I use llm-evaluation?
During ship testing before release, when swapping models or prompts, while debugging bad outputs, and during build when you stand up agent-tooling eval harnesses.
Is llm-evaluation safe to install?
It describes measurement workflows that may touch sample data and external APIs in your own harness; review the Security Audits panel on this page and isolate eval datasets from production secrets.
SKILL.md
READMESKILL.md - Llm Evaluation
# LLM Evaluation Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing. ## When to Use This Skill - Measuring LLM application performance systematically - Comparing different models or prompts - Detecting performance regressions before deployment - Validating improvements from prompt changes - Building confidence in production systems - Establishing baselines and tracking progress over time - Debugging unexpected model behavior ## Core Evaluation Types ### 1. Automated Metrics Fast, repeatable, scalable evaluation using computed scores. **Text Generation:** - **BLEU**: N-gram overlap (translation) - **ROUGE**: Recall-oriented (summarization) - **METEOR**: Semantic similarity - **BERTScore**: Embedding-based similarity - **Perplexity**: Language model confidence **Classification:** - **Accuracy**: Percentage correct - **Precision/Recall/F1**: Class-specific performance - **Confusion Matrix**: Error patterns - **AUC-ROC**: Ranking quality **Retrieval (RAG):** - **MRR**: Mean Reciprocal Rank - **NDCG**: Normalized Discounted Cumulative Gain - **Precision@K**: Relevant in top K - **Recall@K**: Coverage in top K ### 2. Human Evaluation Manual assessment for quality aspects difficult to automate. **Dimensions:** - **Accuracy**: Factual correctness - **Coherence**: Logical flow - **Relevance**: Answers the question - **Fluency**: Natural language quality - **Safety**: No harmful content - **Helpfulness**: Useful to the user ### 3. LLM-as-Judge Use stronger LLMs to evaluate weaker model outputs. **Approaches:** - **Pointwise**: Score individual responses - **Pairwise**: Compare two responses - **Reference-based**: Compare to gold standard - **Reference-free**: Judge without ground truth ## Quick Start ```python from dataclasses import dataclass from typing import Callable import numpy as np @dataclass class Metric: name: str fn: Callable @staticmethod def accuracy(): return Metric("accuracy", calculate_accuracy) @staticmethod def bleu(): return Metric("bleu", calculate_bleu) @staticmethod def bertscore(): return Metric("bertscore", calculate_bertscore) @staticmethod def custom(name: str, fn: Callable): return Metric(name, fn) class EvaluationSuite: def __init__(self, metrics: list[Metric]): self.metrics = metrics async def evaluate(self, model, test_cases: list[dict]) -> dict: results = {m.name: [] for m in self.metrics} for test in test_cases: prediction = await model.predict(test["input"]) for metric in self.metrics: score = metric.fn( prediction=prediction, reference=test.get("expected"), context=test.get("context") ) results[metric.name].append(score) return { "metrics": {k: np.mean(v) for k, v in results.items()}, "raw_scores": results } # Usage suite = EvaluationSuite([ Metric.accuracy(), Metric.bleu(), Metric.bertscore(), Metric.custom("groundedness", check_groundedness) ]) test_cases = [ { "input": "What is the capital of France?", "expected": "Paris", "context": "France is a country in Europe. Paris is its capital." }, ] results = await suite.evaluate(model=your_model, test_cases=test_cases) ``` ## Detailed patterns and worked examples Detailed pattern documentation lives in `references/details.md`. Read that file when the navigation tier above is insufficient. # llm-evaluation — detailed patterns and worked examples ## Automate