
Langsmith Observability
Wire LangSmith evaluate() flows, custom evaluators, and LLM-as-judge scoring so solo builders can trace and grade agent outputs before and after production.
Overview
LangSmith Observability is an agent skill most often used in Build (also Ship, Operate) that teaches custom evaluators, batch evaluate(), and LLM-as-judge patterns for grading LangSmith traces and datasets.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill langsmith-observabilityWhat is this skill?
- Python patterns for custom evaluators returning keyed scores and comments on LangSmith runs
- LLM-as-judge evaluator template using chat completions to rate answers on a 1–5 scale normalized to 0–1
- Uses langsmith.evaluate() with named datasets and evaluator lists for repeatable experiment batches
- Compares model predictions to reference outputs for accuracy-style gates
- Fits advanced observability beyond basic trace logging
- LLM-as-judge scale 1–5 normalized to scorer 0–1
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You can log LLM runs but lack a repeatable way to score them with custom code or judges tied to LangSmith datasets.
Who is it for?
Builders running LangSmith who want dataset-driven evals and automated judges on agent or RAG answers.
Skip if: Teams with no LangSmith project or who only need console logging without structured evaluation.
When should I use this skill?
Implementing LangSmith evaluation, custom evaluators, or LLM-as-judge grading on agent/LLM runs and datasets.
What do I get? / Deliverables
You implement evaluator functions and evaluate() batches that attach accuracy and quality scores to runs so regressions are visible before deploy and in ongoing monitoring.
- Custom evaluator functions wired to evaluate()
- Documented judge prompt and scoring normalization
- Batch evaluation results with keyed metrics
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Evaluation hooks are defined while building LLM features and agent pipelines, which is when datasets, evaluators, and judge prompts are first integrated. LangSmith sits in the agent tooling layer—runs, examples, and evaluators are part of how you instrument models and chains you are actively developing.
Where it fits
Add an accuracy evaluator to a new RAG chain while iterating on prompts in LangSmith.
Run evaluate() on a frozen test dataset before tagging a release candidate.
Reuse LLM-as-judge evaluators on sampled production runs to catch quality drift.
How it compares
Skill package for LangSmith evaluate() and judges—not a hosted MCP server or generic unit-test runner.
Common Questions / FAQ
Who is langsmith-observability for?
Solo and indie developers building LLM or agent features who already use LangSmith and need structured evaluation beyond raw traces.
When should I use langsmith-observability?
In Build while attaching evaluators to new chains; in Ship when running dataset regressions before release; in Operate when monitoring answer quality trends with LLM-as-judge scoring.
Is langsmith-observability safe to install?
The skill includes API usage patterns—confirm keys and data handling in your environment and read the Security Audits panel on this page before running judges on production data.
SKILL.md
READMESKILL.md - Langsmith Observability
# LangSmith Advanced Usage Guide ## Custom Evaluators ### Simple Custom Evaluator ```python from langsmith import evaluate def accuracy_evaluator(run, example): """Check if prediction matches reference.""" prediction = run.outputs.get("answer", "") reference = example.outputs.get("answer", "") score = 1.0 if prediction.strip().lower() == reference.strip().lower() else 0.0 return { "key": "accuracy", "score": score, "comment": f"Predicted: {prediction[:50]}..." } results = evaluate( my_model, data="test-dataset", evaluators=[accuracy_evaluator] ) ``` ### LLM-as-Judge Evaluator ```python from langsmith import evaluate from openai import OpenAI client = OpenAI() def llm_judge_evaluator(run, example): """Use LLM to evaluate response quality.""" prediction = run.outputs.get("answer", "") question = example.inputs.get("question", "") reference = example.outputs.get("answer", "") prompt = f"""Evaluate the following response for accuracy and helpfulness. Question: {question} Reference Answer: {reference} Model Response: {prediction} Rate on a scale of 1-5: 1 = Completely wrong 5 = Perfect answer Respond with just the number.""" response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], max_tokens=10 ) try: score = int(response.choices[0].message.content.strip()) / 5.0 except ValueError: score = 0.5 return { "key": "llm_judge", "score": score, "comment": response.choices[0].message.content } results = evaluate( my_model, data="test-dataset", evaluators=[llm_judge_evaluator] ) ``` ### Async Evaluator ```python from langsmith import aevaluate import asyncio async def async_evaluator(run, example): """Async evaluator for concurrent evaluation.""" prediction = run.outputs.get("answer", "") # Async operation (e.g., API call) score = await compute_similarity_async(prediction, example.outputs["answer"]) return {"key": "similarity", "score": score} async def run_async_eval(): results = await aevaluate( async_model, data="test-dataset", evaluators=[async_evaluator], max_concurrency=10 ) return results results = asyncio.run(run_async_eval()) ``` ### Multiple Return Values ```python def comprehensive_evaluator(run, example): """Return multiple evaluation results.""" prediction = run.outputs.get("answer", "") reference = example.outputs.get("answer", "") return [ {"key": "exact_match", "score": 1.0 if prediction == reference else 0.0}, {"key": "length_ratio", "score": min(len(prediction) / max(len(reference), 1), 1.0)}, {"key": "contains_reference", "score": 1.0 if reference.lower() in prediction.lower() else 0.0} ] ``` ## Summary Evaluators ```python def summary_evaluator(runs, examples): """Compute aggregate metrics across all runs.""" total_latency = sum( (run.end_time - run.start_time).total_seconds() for run in runs if run.end_time and run.start_time ) avg_latency = total_latency / len(runs) if runs else 0 return { "key": "avg_latency", "score": avg_latency } results = evaluate( my_model, data="test-dataset", evaluators=[accuracy_evaluator], summary_evaluators=[summary_evaluator] ) ``` ## Comparative Evaluation ```python from langsmith import evaluate_comparative def pairwise_judge(runs, example): """Compare two model outputs.""" output_a = runs[0].outputs.get("answer", "") output_b = runs[1].outputs.get("answer", "") reference = example.outputs.get("answer", "") # Use LLM to compare prompt = f"""Compare these two answers to the question. Question: {example.inputs['question']} Reference: {reference} Answer A: {output_a} Answer B: {output_b} Which is better? Respond with 'A', 'B'