Langsmith Evaluator

Name: Langsmith Evaluator
Author: langchain-ai

langchain-ai/langsmith-skills

Stand up LangSmith evaluators, run functions, and evaluate() runs so you can score agent outputs before and after shipping changes.

Overview

LangSmith Evaluator is an agent skill most often used in Ship (also Build, Grow) that teaches evaluators, run functions, and evaluate() flows for LangSmith agent quality pipelines.

Install

npx skills add https://github.com/langchain-ai/langsmith-skills --skill langsmith-evaluator

What is this skill?

Three-component map: Creating Evaluators, Defining Run Functions, Running Evaluations
Supports LLM-as-Judge and custom code evaluators with Python and TypeScript examples
LangSmith CLI for listing and managing evaluators with --api-key when env is unset
Documents required LANGSMITH_API_KEY plus LANGSMITH_PROJECT, workspace ID, and OpenAI for judges
Local evaluate() and auto-run via uploaded evaluators on LangSmith
3 core components: Creating Evaluators, Defining Run Functions, Running Evaluations

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 2.3k installs on skills.sh; 130 GitHub stars; 2/3 security scanners passed (skills.sh audits).

What problem does it solve?

You have LangSmith traces and an agent but no structured way to score outputs with judges or custom evaluators on every change.

Who is it for?

Solo builders shipping LangChain/LangSmith agents who need LLM-as-Judge or custom metrics tied to LANGSMITH_PROJECT traces.

Skip if: Teams with no LangSmith account, no API key, or apps that only need classic non-LLM unit tests without trajectory evaluation.

When should I use this skill?

INVOKE THIS SKILL when building evaluation pipelines for LangSmith.

What do I get? / Deliverables

You can create evaluators, capture trajectories in run functions, and run LangSmith evaluations locally or via uploaded auto-run evaluators with authenticated CLI and API setup.

Evaluator definitions
Run functions for agent outputs/trajectories
Executed evaluation runs via evaluate() or LangSmith auto-run

Recommended Skills

Microsoft Foundrymicrosoft/azure-skills

Microsoft Foundry skill guides agents through the full Azure AI Foundry lifecycle—containerizing agents, pushing to ACR,…377k installs·1.2k stars

Azure Aimicrosoft/azure-skills

azure-ai is a Prism-oriented quick reference for Microsoft Azure AI work, with the published body centered on the Azure …375k installs·1.2k stars

Azure Hosted Copilot Sdkmicrosoft/azure-skills

Azure Hosted Copilot SDK is Microsoft's entry skill for repos using @github/copilot-sdk—it detects CopilotClient usage, …346k installs·1.2k stars

Lark Eventlarksuite/cli

Lark real-time subscription skill via lark-cli event consume for building bots and streaming webhook-style agent workers…208k installs·13.7k stars

Running Claude Code Via Litellm Copilotxixu-me/skills

Running Claude Code via LiteLLM Copilot walks through pointing Claude Code at a local LiteLLM proxy that forwards Anthro…200k installs·61 stars

Setup Matt Pocock Skillsmattpocock/skills

One-time per-repo setup so Matt Pocock engineering skills share correct issue tracker, triage strings, and domain docume…180k installs·121k stars

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Systematic LangSmith evaluation is the canonical shelf in Ship because it validates agent quality with datasets and judges—the same loop you rerun after launch regressions. Testing subphase fits LLM-as-Judge evaluators, trajectory capture, and local or CLI-driven evaluate() runs against traced projects.

Also useful

BuildAgent skills & templates

Also useful

GrowAnalytics & insights

Where it fits

Example use

BuildAgent skills & templates

Wire a run function so each agent invocation logs outputs LangSmith evaluators can score during development.

Example use

ShipTesting & QA

Run evaluate() with LLM-as-Judge before merging a prompt or tool-change that could drift behavior.

Example use

GrowAnalytics & insights

Re-run uploaded auto-evaluators against new traces to compare weekly quality after distribution pushes more traffic.

How it compares

Skill package for LangSmith’s eval stack—not a substitute for pytest-only suites or a hosted MCP observability server you did not configure.

Common Questions / FAQ

Who is langsmith-evaluator for?

Indie developers and small teams building agents on LangSmith who want repeatable evaluation pipelines with CLI and SDK patterns in Python or TypeScript.

When should I use langsmith-evaluator?

Use it in Ship/testing before releases to score agent runs; in Build/agent-tooling while wiring capture functions; and in Grow/analytics when you revisit judge metrics after production regressions.

Is langsmith-evaluator safe to install?

The skill documents API keys and external LangSmith/OpenAI calls; review the Security Audits panel on this page and rotate keys if you paste them into agent sessions.

SKILL.md

READMESKILL.md - Langsmith Evaluator

<oneliner>
Three core components: **(1) Creating Evaluators** - LLM-as-Judge, custom code; **(2) Defining Run Functions** - capture agent outputs/trajectories for evaluation; **(3) Running Evaluations** - locally with `evaluate()` or auto-run via uploaded evaluators. Python and TypeScript examples included.
</oneliner>

<setup>
Environment Variables

```bash
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here          # REQUIRED
LANGSMITH_PROJECT=your-project-name                   # Check this to know which project has traces
LANGSMITH_WORKSPACE_ID=your-workspace-id              # Optional: for org-scoped keys
OPENAI_API_KEY=your_openai_key                        # For LLM as Judge
```

Authentication is REQUIRED: either set the `LANGSMITH_API_KEY` environment variable, or pass the `--api-key` flag to CLI commands (preferred):
```bash
langsmith evaluator list --api-key $LANGSMITH_API_KEY
```

**IMPORTANT:** Always check the environment variables or `.env` file for `LANGSMITH_PROJECT` before querying or interacting with LangSmith. This tells you which project contains the relevant traces and data. If the LangSmith project is not available, use your best judgement to identify the right one.

Python Dependencies
```bash
pip install langsmith langchain-openai python-dotenv
```

CLI Tool (for uploading evaluators)
```bash
curl -sSL https://raw.githubusercontent.com/langchain-ai/langsmith-cli/main/scripts/install.sh | sh
```

JavaScript Dependencies
```bash
npm install langsmith openai
```
</setup>

<crucial_requirement>
## Golden Rule: Inspect Before You Implement

**CRITICAL:** Before writing ANY evaluator or extraction logic, you MUST:
1. **Run your agent** on sample inputs and capture the actual output
2. **Inspect the output** - print it, query LangSmith traces, understand the exact structure
3. **Only then** write code that processes that output

Output structures vary significantly by framework, agent type, and configuration. Never assume the shape - always verify first. Query LangSmith traces to when outputs don't contain needed data to understand how to extract from execution.
</crucial_requirement>

<evaluator_format>
## Offline vs Online Evaluators

**Offline Evaluators** (attached to datasets):
- Function signature: `(run, example)` - receives both run outputs and dataset example
- Use case: Comparing agent outputs to expected values in a dataset
- Upload with: `--dataset "Dataset Name"`

**Online Evaluators** (attached to projects):
- Function signature: `(run)` - receives only run outputs, NO example parameter
- Use case: Real-time quality checks on production runs (no reference data)
- Upload with: `--project "Project Name"`

**CRITICAL - Return Format:**
- Each evaluator returns **ONE metric only**. For multiple metrics, create multiple evaluator functions.
- Do NOT return `{"metric_name": value}` or lists of metrics - this will error.

**CRITICAL - Local vs Uploaded Differences:**

| | Local `evaluate()` | Uploaded to LangSmith |
|---|---|---|
| **Column name** | Python: auto-derived from function name. TypeScript: must include `key` field or column is untitled | Comes from evaluator name set at upload time. Do NOT include `key` — it creates a duplicate column |
| **Python `run` type** | `RunTree` object → `run.outputs` (attribute) | `dict` → `run["outputs"]` (subscript). Handle both: `run.outputs if hasattr(run, "outputs") else run.get("outputs", {})` |
| **TypeScript `run` type** | Always attribute access: `run.outputs?.field` | Always attribute access: `run.outputs?.field` |
| **Python return** | `{"score": value, "comment": "..."}` |

What is this skill?

Three-component map: Creating Evaluators, Defining Run Functions, Running Evaluations

Supports LLM-as-Judge and custom code evaluators with Python and TypeScript examples

LangSmith CLI for listing and managing evaluators with --api-key when env is unset

Documents required LANGSMITH_API_KEY plus LANGSMITH_PROJECT, workspace ID, and OpenAI for judges

Local evaluate() and auto-run via uploaded evaluators on LangSmith

3 core components: Creating Evaluators, Defining Run Functions, Running Evaluations

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 2.3k installs on skills.sh; 130 GitHub stars; 2/3 security scanners passed (skills.sh audits).

What do I get? / Deliverables

You can create evaluators, capture trajectories in run functions, and run LangSmith evaluations locally or via uploaded auto-run evaluators with authenticated CLI and API setup.

Evaluator definitions

Run functions for agent outputs/trajectories

Executed evaluation runs via evaluate() or LangSmith auto-run

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

BuildAgent skills & templates

Also useful

GrowAnalytics & insights

Where it fits

Example use

BuildAgent skills & templates

Wire a run function so each agent invocation logs outputs LangSmith evaluators can score during development.

Example use

ShipTesting & QA

Run evaluate() with LLM-as-Judge before merging a prompt or tool-change that could drift behavior.

Example use

GrowAnalytics & insights

Re-run uploaded auto-evaluators against new traces to compare weekly quality after distribution pushes more traffic.

SKILL.md

READMESKILL.md - Langsmith Evaluator

<oneliner>
Three core components: **(1) Creating Evaluators** - LLM-as-Judge, custom code; **(2) Defining Run Functions** - capture agent outputs/trajectories for evaluation; **(3) Running Evaluations** - locally with `evaluate()` or auto-run via uploaded evaluators. Python and TypeScript examples included.
</oneliner>

<setup>
Environment Variables

```bash
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here          # REQUIRED
LANGSMITH_PROJECT=your-project-name                   # Check this to know which project has traces
LANGSMITH_WORKSPACE_ID=your-workspace-id              # Optional: for org-scoped keys
OPENAI_API_KEY=your_openai_key                        # For LLM as Judge
```

Authentication is REQUIRED: either set the `LANGSMITH_API_KEY` environment variable, or pass the `--api-key` flag to CLI commands (preferred):
```bash
langsmith evaluator list --api-key $LANGSMITH_API_KEY
```

**IMPORTANT:** Always check the environment variables or `.env` file for `LANGSMITH_PROJECT` before querying or interacting with LangSmith. This tells you which project contains the relevant traces and data. If the LangSmith project is not available, use your best judgement to identify the right one.

Python Dependencies
```bash
pip install langsmith langchain-openai python-dotenv
```

CLI Tool (for uploading evaluators)
```bash
curl -sSL https://raw.githubusercontent.com/langchain-ai/langsmith-cli/main/scripts/install.sh | sh
```

JavaScript Dependencies
```bash
npm install langsmith openai
```
</setup>

<crucial_requirement>
## Golden Rule: Inspect Before You Implement

**CRITICAL:** Before writing ANY evaluator or extraction logic, you MUST:
1. **Run your agent** on sample inputs and capture the actual output
2. **Inspect the output** - print it, query LangSmith traces, understand the exact structure
3. **Only then** write code that processes that output

Output structures vary significantly by framework, agent type, and configuration. Never assume the shape - always verify first. Query LangSmith traces to when outputs don't contain needed data to understand how to extract from execution.
</crucial_requirement>

<evaluator_format>
## Offline vs Online Evaluators

**Offline Evaluators** (attached to datasets):
- Function signature: `(run, example)` - receives both run outputs and dataset example
- Use case: Comparing agent outputs to expected values in a dataset
- Upload with: `--dataset "Dataset Name"`

**Online Evaluators** (attached to projects):
- Function signature: `(run)` - receives only run outputs, NO example parameter
- Use case: Real-time quality checks on production runs (no reference data)
- Upload with: `--project "Project Name"`

**CRITICAL - Return Format:**
- Each evaluator returns **ONE metric only**. For multiple metrics, create multiple evaluator functions.
- Do NOT return `{"metric_name": value}` or lists of metrics - this will error.

**CRITICAL - Local vs Uploaded Differences:**

| | Local `evaluate()` | Uploaded to LangSmith |
|---|---|---|
| **Column name** | Python: auto-derived from function name. TypeScript: must include `key` field or column is untitled | Comes from evaluator name set at upload time. Do NOT include `key` — it creates a duplicate column |
| **Python `run` type** | `RunTree` object → `run.outputs` (attribute) | `dict` → `run["outputs"]` (subscript). Handle both: `run.outputs if hasattr(run, "outputs") else run.get("outputs", {})` |
| **TypeScript `run` type** | Always attribute access: `run.outputs?.field` | Always attribute access: `run.outputs?.field` |
| **Python return** | `{"score": value, "comment": "..."}` |

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is langsmith-evaluator for?

When should I use langsmith-evaluator?

Is langsmith-evaluator safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Where it fits

Who is langsmith-evaluator for?

When should I use langsmith-evaluator?

Is langsmith-evaluator safe to install?

SKILL.md