
Langsmith Evaluator
Stand up LangSmith evaluators, run functions, and evaluate() runs so you can score agent outputs before and after shipping changes.
Overview
LangSmith Evaluator is an agent skill most often used in Ship (also Build, Grow) that teaches evaluators, run functions, and evaluate() flows for LangSmith agent quality pipelines.
Install
npx skills add https://github.com/langchain-ai/langsmith-skills --skill langsmith-evaluatorWhat is this skill?
- Three-component map: Creating Evaluators, Defining Run Functions, Running Evaluations
- Supports LLM-as-Judge and custom code evaluators with Python and TypeScript examples
- LangSmith CLI for listing and managing evaluators with --api-key when env is unset
- Documents required LANGSMITH_API_KEY plus LANGSMITH_PROJECT, workspace ID, and OpenAI for judges
- Local evaluate() and auto-run via uploaded evaluators on LangSmith
- 3 core components: Creating Evaluators, Defining Run Functions, Running Evaluations
Adoption & trust: 2.3k installs on skills.sh; 130 GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have LangSmith traces and an agent but no structured way to score outputs with judges or custom evaluators on every change.
Who is it for?
Solo builders shipping LangChain/LangSmith agents who need LLM-as-Judge or custom metrics tied to LANGSMITH_PROJECT traces.
Skip if: Teams with no LangSmith account, no API key, or apps that only need classic non-LLM unit tests without trajectory evaluation.
When should I use this skill?
INVOKE THIS SKILL when building evaluation pipelines for LangSmith.
What do I get? / Deliverables
You can create evaluators, capture trajectories in run functions, and run LangSmith evaluations locally or via uploaded auto-run evaluators with authenticated CLI and API setup.
- Evaluator definitions
- Run functions for agent outputs/trajectories
- Executed evaluation runs via evaluate() or LangSmith auto-run
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Systematic LangSmith evaluation is the canonical shelf in Ship because it validates agent quality with datasets and judges—the same loop you rerun after launch regressions. Testing subphase fits LLM-as-Judge evaluators, trajectory capture, and local or CLI-driven evaluate() runs against traced projects.
Where it fits
Wire a run function so each agent invocation logs outputs LangSmith evaluators can score during development.
Run evaluate() with LLM-as-Judge before merging a prompt or tool-change that could drift behavior.
Re-run uploaded auto-evaluators against new traces to compare weekly quality after distribution pushes more traffic.
How it compares
Skill package for LangSmith’s eval stack—not a substitute for pytest-only suites or a hosted MCP observability server you did not configure.
Common Questions / FAQ
Who is langsmith-evaluator for?
Indie developers and small teams building agents on LangSmith who want repeatable evaluation pipelines with CLI and SDK patterns in Python or TypeScript.
When should I use langsmith-evaluator?
Use it in Ship/testing before releases to score agent runs; in Build/agent-tooling while wiring capture functions; and in Grow/analytics when you revisit judge metrics after production regressions.
Is langsmith-evaluator safe to install?
The skill documents API keys and external LangSmith/OpenAI calls; review the Security Audits panel on this page and rotate keys if you paste them into agent sessions.
SKILL.md
READMESKILL.md - Langsmith Evaluator
<oneliner> Three core components: **(1) Creating Evaluators** - LLM-as-Judge, custom code; **(2) Defining Run Functions** - capture agent outputs/trajectories for evaluation; **(3) Running Evaluations** - locally with `evaluate()` or auto-run via uploaded evaluators. Python and TypeScript examples included. </oneliner> <setup> Environment Variables ```bash LANGSMITH_API_KEY=lsv2_pt_your_api_key_here # REQUIRED LANGSMITH_PROJECT=your-project-name # Check this to know which project has traces LANGSMITH_WORKSPACE_ID=your-workspace-id # Optional: for org-scoped keys OPENAI_API_KEY=your_openai_key # For LLM as Judge ``` Authentication is REQUIRED: either set the `LANGSMITH_API_KEY` environment variable, or pass the `--api-key` flag to CLI commands (preferred): ```bash langsmith evaluator list --api-key $LANGSMITH_API_KEY ``` **IMPORTANT:** Always check the environment variables or `.env` file for `LANGSMITH_PROJECT` before querying or interacting with LangSmith. This tells you which project contains the relevant traces and data. If the LangSmith project is not available, use your best judgement to identify the right one. Python Dependencies ```bash pip install langsmith langchain-openai python-dotenv ``` CLI Tool (for uploading evaluators) ```bash curl -sSL https://raw.githubusercontent.com/langchain-ai/langsmith-cli/main/scripts/install.sh | sh ``` JavaScript Dependencies ```bash npm install langsmith openai ``` </setup> <crucial_requirement> ## Golden Rule: Inspect Before You Implement **CRITICAL:** Before writing ANY evaluator or extraction logic, you MUST: 1. **Run your agent** on sample inputs and capture the actual output 2. **Inspect the output** - print it, query LangSmith traces, understand the exact structure 3. **Only then** write code that processes that output Output structures vary significantly by framework, agent type, and configuration. Never assume the shape - always verify first. Query LangSmith traces to when outputs don't contain needed data to understand how to extract from execution. </crucial_requirement> <evaluator_format> ## Offline vs Online Evaluators **Offline Evaluators** (attached to datasets): - Function signature: `(run, example)` - receives both run outputs and dataset example - Use case: Comparing agent outputs to expected values in a dataset - Upload with: `--dataset "Dataset Name"` **Online Evaluators** (attached to projects): - Function signature: `(run)` - receives only run outputs, NO example parameter - Use case: Real-time quality checks on production runs (no reference data) - Upload with: `--project "Project Name"` **CRITICAL - Return Format:** - Each evaluator returns **ONE metric only**. For multiple metrics, create multiple evaluator functions. - Do NOT return `{"metric_name": value}` or lists of metrics - this will error. **CRITICAL - Local vs Uploaded Differences:** | | Local `evaluate()` | Uploaded to LangSmith | |---|---|---| | **Column name** | Python: auto-derived from function name. TypeScript: must include `key` field or column is untitled | Comes from evaluator name set at upload time. Do NOT include `key` — it creates a duplicate column | | **Python `run` type** | `RunTree` object → `run.outputs` (attribute) | `dict` → `run["outputs"]` (subscript). Handle both: `run.outputs if hasattr(run, "outputs") else run.get("outputs", {})` | | **TypeScript `run` type** | Always attribute access: `run.outputs?.field` | Always attribute access: `run.outputs?.field` | | **Python return** | `{"score": value, "comment": "..."}` |