Phoenix Evals

Name: Phoenix Evals
Author: arize-ai

arize-ai/phoenix

1k installs
10.8k repo stars
Updated July 28, 2026
arize-ai/phoenix

phoenix-evals is an Arize Phoenix skill that applies axial coding to group open-ended agent failure notes into structured, quantifiable taxonomies for developers who need systematic LLM evaluation feedback loops.

About

phoenix-evals is an evaluation workflow skill from arize-ai/phoenix that implements axial coding to transform unstructured agent failure notes into actionable, countable categories. The four-step process gathers open coding notes, groups them by shared themes, names actionable categories, and quantifies failure counts per category with YAML taxonomies covering content quality, communication, and context dimensions. Developers building LLM agents reach for phoenix-evals after collecting qualitative eval notes and needing structured failure taxonomies—such as hallucination, tone mismatch, or ignored user context—to drive prioritized prompt, retrieval, and tooling fixes in Phoenix eval workflows.

4-step Axial Coding process: Gather, Pattern, Name, Quantify
Creates hierarchical failure_taxonomy in clean YAML
Supports human and programmatic span annotations
Covers content_quality, communication, context, and safety categories
Works with both Python and TypeScript Phoenix clients

Phoenix Evals by the numbers

1,014 all-time installs (skills.sh)
+83 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #990 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/arize-ai/phoenix --skill phoenix-evals

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/arize-ai/phoenix/phoenix-evals.svg)](https://skillselion.com/skills/arize-ai/phoenix/phoenix-evals)

Installs	1k
repo stars	★ 10.8k
Security audit	3 / 3 scanners passed
Last updated	July 28, 2026
Repository	arize-ai/phoenix ↗

How do you taxonomy agent failure notes from evals?

Systematically group open-ended notes about agent failures into structured, quantifiable taxonomies.

Who is it for?

ML and agent engineers running Phoenix evals who have qualitative failure notes and need quantified category groupings.

Skip if: Running live production traces, automated pass-fail unit tests, or teams without collected open coding notes to analyze.

When should I use this skill?

A developer has agent eval failure notes and asks to group, name, and quantify failure patterns into a structured taxonomy.

What you get

Structured failure taxonomy YAML, named axial categories, and per-category failure counts for prioritization.

failure taxonomy YAML
category failure counts
named axial code groups

By the numbers

Uses a 4-step axial coding process: gather, pattern, name, quantify

Files

references/

SKILL.mdMarkdownGitHub ↗

Phoenix Evals

Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.

Quick Reference

Task	Files
Setup	setup-python, setup-typescript
Decide what to evaluate	evaluators-overview
Choose a judge model	fundamentals-model-selection
Use pre-built evaluators	evaluators-pre-built
Build code evaluator	evaluators-code-python, evaluators-code-typescript
Build LLM evaluator	evaluators-llm-python, evaluators-llm-typescript, evaluators-custom-templates
Batch evaluate DataFrame	evaluate-dataframe-python
Run experiment	experiments-running-python, experiments-running-typescript
Create dataset	experiments-datasets-python, experiments-datasets-typescript
Generate synthetic data	experiments-synthetic-python, experiments-synthetic-typescript
Validate evaluator accuracy	validation, validation-evaluators-python, validation-evaluators-typescript
Sample traces for review	observe-sampling-python, observe-sampling-typescript
Analyze errors	error-analysis, error-analysis-multi-turn, axial-coding
RAG evals	evaluators-rag
Avoid common mistakes	common-mistakes-python, fundamentals-anti-patterns
Production	production-overview, production-guardrails, production-continuous

Workflows

Starting Fresh: observe-tracing-setup → error-analysis → axial-coding → evaluators-overview

Building Evaluator: fundamentals → common-mistakes-python → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}

RAG Systems: evaluators-rag → evaluators-code- (retrieval) → evaluators-llm- (faithfulness)

Production: production-overview → production-guardrails → production-continuous

Reference Categories

Prefix	Description
`fundamentals-*`	Types, scores, anti-patterns
`observe-*`	Tracing, sampling
`error-analysis-*`	Finding failures
`axial-coding-*`	Categorizing failures
`evaluators-*`	Code, LLM, RAG evaluators
`experiments-*`	Datasets, running experiments
`validation-*`	Validating evaluator accuracy against human labels
`production-*`	CI/CD, monitoring

Key Principles

Principle	Action
Error analysis first	Can't automate what you haven't observed
Custom > generic	Build from your failures
Code first	Deterministic before LLM
Validate judges	>80% TPR/TNR
Binary > Likert	Pass/fail, not 1-5

Axial Coding

Group open-ended notes into structured failure taxonomies.

Process

1. Gather - Collect open coding notes 2. Pattern - Group notes with common themes 3. Name - Create actionable category names 4. Quantify - Count failures per category

Example Taxonomy

failure_taxonomy:
  content_quality:
    hallucination: [invented_facts, fictional_citations]
    incompleteness: [partial_answer, missing_key_info]
    inaccuracy: [wrong_numbers, wrong_dates]
  
  communication:
    tone_mismatch: [too_casual, too_formal]
    clarity: [ambiguous, jargon_heavy]
  
  context:
    user_context: [ignored_preferences, misunderstood_intent]
    retrieved_context: [ignored_documents, wrong_context]
  
  safety:
    missing_disclaimers: [legal, medical, financial]

Add Annotation (Python)

from phoenix.client import Client

client = Client()
client.spans.add_span_annotation(
    span_id="abc123",
    annotation_name="failure_category",
    label="hallucination",
    explanation="invented a feature that doesn't exist",
    annotator_kind="HUMAN",
    sync=True,
)

Add Annotation (TypeScript)

import { addSpanAnnotation } from "@arizeai/phoenix-client/spans";

await addSpanAnnotation({
  spanAnnotation: {
    spanId: "abc123",
    name: "failure_category",
    label: "hallucination",
    explanation: "invented a feature that doesn't exist",
    annotatorKind: "HUMAN",
  }
});

Agent Failure Taxonomy

agent_failures:
  planning: [wrong_plan, incomplete_plan]
  tool_selection: [wrong_tool, missed_tool, unnecessary_call]
  tool_execution: [wrong_parameters, type_error]
  state_management: [lost_context, stuck_in_loop]
  error_recovery: [no_fallback, wrong_fallback]

Transition Matrix (Agents)

Shows where failures occur between states:

def build_transition_matrix(conversations, states):
    matrix = defaultdict(lambda: defaultdict(int))
    for conv in conversations:
        if conv["failed"]:
            last_success = find_last_success(conv)
            first_failure = find_first_failure(conv)
            matrix[last_success][first_failure] += 1
    return pd.DataFrame(matrix).fillna(0)

Principles

MECE - Each failure fits ONE category
Actionable - Categories suggest fixes
Bottom-up - Let categories emerge from data

Common Mistakes (Python)

Patterns that LLMs frequently generate incorrectly from training data.

Legacy Model Classes

# WRONG
from phoenix.evals import OpenAIModel, AnthropicModel
model = OpenAIModel(model="gpt-4")

# RIGHT
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o")

Why: OpenAIModel, AnthropicModel, etc. are legacy 1.0 wrappers in phoenix.evals.legacy. The LLM class is provider-agnostic and is the current 2.0 API.

Using run_evals Instead of evaluate_dataframe

# WRONG — legacy 1.0 API
from phoenix.evals import run_evals
results = run_evals(dataframe=df, evaluators=[eval1], provide_explanation=True)
# Returns list of DataFrames

# RIGHT — current 2.0 API
from phoenix.evals import evaluate_dataframe
results_df = evaluate_dataframe(dataframe=df, evaluators=[eval1])
# Returns single DataFrame with {name}_score dict columns

Why: run_evals is the legacy 1.0 batch function. evaluate_dataframe is the current 2.0 function with a different return format.

Wrong Result Column Names

# WRONG — column doesn't exist
score = results_df["relevance"].mean()

# WRONG — column exists but contains dicts, not numbers
score = results_df["relevance_score"].mean()

# RIGHT — extract numeric score from dict
scores = results_df["relevance_score"].apply(
    lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
)
score = scores.mean()

Why: evaluate_dataframe returns columns named {name}_score containing Score dicts like {"name": "...", "score": 1.0, "label": "...", "explanation": "..."}.

Deprecated project_name Parameter

# WRONG
df = client.spans.get_spans_dataframe(project_name="my-project")

# RIGHT
df = client.spans.get_spans_dataframe(project_identifier="my-project")

Why: project_name is deprecated in favor of project_identifier, which also accepts project IDs.

Wrong Client Constructor

# WRONG
client = Client(endpoint="https://app.phoenix.arize.com")
client = Client(url="https://app.phoenix.arize.com")

# RIGHT — for remote/cloud Phoenix
client = Client(base_url="https://app.phoenix.arize.com", api_key="...")

# ALSO RIGHT — for local Phoenix (falls back to env vars or localhost:6006)
client = Client()

Why: The parameter is base_url, not endpoint or url. For local instances, Client() with no args works fine. For remote instances, base_url and api_key are required.

Too-Aggressive Time Filters

# WRONG — often returns zero spans
from datetime import datetime, timedelta
df = client.spans.get_spans_dataframe(
    project_identifier="my-project",
    start_time=datetime.now() - timedelta(hours=1),
)

# RIGHT — use limit to control result size instead
df = client.spans.get_spans_dataframe(
    project_identifier="my-project",
    limit=50,
)

Why: Traces may be from any time period. A 1-hour window frequently returns nothing. Use limit= to control result size instead.

Not Filtering Spans Appropriately

# WRONG — fetches all spans including internal LLM calls, retrievers, etc.
df = client.spans.get_spans_dataframe(project_identifier="my-project")

# RIGHT for end-to-end evaluation — filter to top-level spans
df = client.spans.get_spans_dataframe(
    project_identifier="my-project",
    root_spans_only=True,
)

# RIGHT for RAG evaluation — fetch child spans for retriever/LLM metrics
all_spans = client.spans.get_spans_dataframe(
    project_identifier="my-project",
)
retriever_spans = all_spans[all_spans["span_kind"] == "RETRIEVER"]
llm_spans = all_spans[all_spans["span_kind"] == "LLM"]

Why: For end-to-end evaluation (e.g., overall answer quality), use root_spans_only=True. For RAG systems, you often need child spans separately — retriever spans for DocumentRelevance and LLM spans for Faithfulness. Choose the right span level for your evaluation target.

Assuming Span Output is Plain Text

# WRONG — output may be JSON, not plain text
df["output"] = df["attributes.output.value"]

# RIGHT — parse JSON and extract the answer field
import json

def extract_answer(output_value):
    if not isinstance(output_value, str):
        return str(output_value) if output_value is not None else ""
    try:
        parsed = json.loads(output_value)
        if isinstance(parsed, dict):
            for key in ("answer", "result", "output", "response"):
                if key in parsed:
                    return str(parsed[key])
    except (json.JSONDecodeError, TypeError):
        pass
    return output_value

df["output"] = df["attributes.output.value"].apply(extract_answer)

Why: LangChain and other frameworks often output structured JSON from root spans, like {"context": "...", "question": "...", "answer": "..."}. Evaluators need the actual answer text, not the raw JSON.

Using @create_evaluator for LLM-Based Evaluation

# WRONG — @create_evaluator doesn't call an LLM
@create_evaluator(name="relevance", kind="llm")
def relevance(input: str, output: str) -> str:
    pass  # No LLM is involved

# RIGHT — use ClassificationEvaluator for LLM-based evaluation
from phoenix.evals import ClassificationEvaluator, LLM

relevance = ClassificationEvaluator(
    name="relevance",
    prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"relevant": 1.0, "irrelevant": 0.0},
)

Why: @create_evaluator wraps a plain Python function. Setting kind="llm" marks it as LLM-based but you must implement the LLM call yourself. For LLM-based evaluation, prefer ClassificationEvaluator which handles the LLM call, structured output parsing, and explanations automatically.

Using llm_classify Instead of ClassificationEvaluator

# WRONG — legacy 1.0 API
from phoenix.evals import llm_classify
results = llm_classify(
    dataframe=df,
    template=template_str,
    model=model,
    rails=["relevant", "irrelevant"],
)

# RIGHT — current 2.0 API
from phoenix.evals import ClassificationEvaluator, async_evaluate_dataframe, LLM

classifier = ClassificationEvaluator(
    name="relevance",
    prompt_template=template_str,
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"relevant": 1.0, "irrelevant": 0.0},
)
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[classifier])

Why: llm_classify is the legacy 1.0 function. The current pattern is to create an evaluator with ClassificationEvaluator and run it with async_evaluate_dataframe().

Using HallucinationEvaluator

# WRONG — deprecated
from phoenix.evals import HallucinationEvaluator
eval = HallucinationEvaluator(model)

# RIGHT — use FaithfulnessEvaluator
from phoenix.evals.metrics import FaithfulnessEvaluator
from phoenix.evals import LLM
eval = FaithfulnessEvaluator(llm=LLM(provider="openai", model="gpt-4o"))

Why: HallucinationEvaluator is deprecated. FaithfulnessEvaluator is its replacement, using "faithful"/"unfaithful" labels with maximized score (1.0 = faithful).

Error Analysis: Multi-Turn Conversations

Debugging complex multi-turn conversation traces.

The Approach

1. End-to-end first - Did the conversation achieve the goal? 2. Find first failure - Trace backwards to root cause 3. Simplify - Try single-turn before multi-turn debug 4. N-1 testing - Isolate turn-specific vs capability issues

Find First Upstream Failure

Turn 1: User asks about flights ✓
Turn 2: Assistant asks for dates ✓
Turn 3: User provides dates ✓
Turn 4: Assistant searches WRONG dates ← FIRST FAILURE
Turn 5: Shows wrong flights (consequence)
Turn 6: User frustrated (consequence)

Focus on Turn 4, not Turn 6.

Simplify First

Before debugging multi-turn, test single-turn:

# If single-turn also fails → problem is retrieval/knowledge
# If single-turn passes → problem is conversation context
response = chat("What's the return policy for electronics?")

N-1 Testing

Give turns 1 to N-1 as context, test turn N:

context = conversation[:n-1]
response = chat_with_context(context, user_message_n)
# Compare to actual turn N

This isolates whether error is from context or underlying capability.

Checklist

1. Did conversation achieve goal? (E2E) 2. Which turn first went wrong? 3. Can you reproduce with single-turn? 4. Is error from context or capability? (N-1 test)

Error Analysis

Review traces to discover failure modes before building evaluators.

Process

1. Sample - 100+ traces (errors, negative feedback, random) 2. Open Code - Write free-form notes per trace 3. Axial Code - Group notes into failure categories 4. Quantify - Count failures per category 5. Prioritize - Rank by frequency × severity

Sample Traces

Span-level sampling (Python — DataFrame)

from phoenix.client import Client

# Client() works for local Phoenix (falls back to env vars or localhost:6006)
# For remote/cloud: Client(base_url="https://app.phoenix.arize.com", api_key="...")
client = Client()
spans_df = client.spans.get_spans_dataframe(project_identifier="my-app")

# Build representative sample
sample = pd.concat([
    spans_df[spans_df["status_code"] == "ERROR"].sample(30),
    spans_df[spans_df["feedback"] == "negative"].sample(30),
    spans_df.sample(40),
]).drop_duplicates("span_id").head(100)

Span-level sampling (TypeScript)

import { getSpans } from "@arizeai/phoenix-client/spans";

const { spans: errors } = await getSpans({
  project: { projectName: "my-app" },
  statusCode: "ERROR",
  limit: 30,
});
const { spans: allSpans } = await getSpans({
  project: { projectName: "my-app" },
  limit: 70,
});
const sample = [...errors, ...allSpans.sort(() => Math.random() - 0.5).slice(0, 40)];
const unique = [...new Map(sample.map((s) => [s.context.span_id, s])).values()].slice(0, 100);

Trace-level sampling (Python)

When errors span multiple spans (e.g., agent workflows), sample whole traces:

from datetime import datetime, timedelta

traces = client.traces.get_traces(
    project_identifier="my-app",
    start_time=datetime.now() - timedelta(hours=24),
    include_spans=True,
    sort="latency_ms",
    order="desc",
    limit=100,
)
# Each trace has: trace_id, start_time, end_time, spans

Trace-level sampling (TypeScript)

import { getTraces } from "@arizeai/phoenix-client/traces";

const { traces } = await getTraces({
  project: { projectName: "my-app" },
  startTime: new Date(Date.now() - 24 * 60 * 60 * 1000),
  includeSpans: true,
  limit: 100,
});

Add Notes (Python)

client.spans.add_span_note(
    span_id="abc123",
    note="wrong timezone - said 3pm EST but user is PST"
)

Add Notes (TypeScript)

import { addSpanNote } from "@arizeai/phoenix-client/spans";

await addSpanNote({
  spanNote: {
    spanId: "abc123",
    note: "wrong timezone - said 3pm EST but user is PST"
  }
});

What to Note

Type	Examples
Factual errors	Wrong dates, prices, made-up features
Missing info	Didn't answer question, omitted details
Tone issues	Too casual/formal for context
Tool issues	Wrong tool, wrong parameters
Retrieval	Wrong docs, missing relevant docs

Good Notes

BAD:  "Response is bad"
GOOD: "Response says ships in 2 days but policy is 5-7 days"

Group into Categories

categories = {
    "factual_inaccuracy": ["wrong shipping time", "incorrect price"],
    "hallucination": ["made up a discount", "invented feature"],
    "tone_mismatch": ["informal for enterprise client"],
}
# Priority = Frequency × Severity

Retrieve Existing Annotations

Python

# From a spans DataFrame
annotations_df = client.spans.get_span_annotations_dataframe(
    spans_dataframe=sample,
    project_identifier="my-app",
    include_annotation_names=["quality", "correctness"],
)
# annotations_df has: span_id (index), name, label, score, explanation

# Or from specific span IDs
annotations_df = client.spans.get_span_annotations_dataframe(
    span_ids=["span-id-1", "span-id-2"],
    project_identifier="my-app",
)

TypeScript

import { getSpanAnnotations } from "@arizeai/phoenix-client/spans";

const { annotations } = await getSpanAnnotations({
  project: { projectName: "my-app" },
  spanIds: ["span-id-1", "span-id-2"],
  includeAnnotationNames: ["quality", "correctness"],
});

for (const ann of annotations) {
  console.log(`${ann.span_id}: ${ann.name} = ${ann.result?.label} (${ann.result?.score})`);
}

Saturation

Stop when new traces reveal no new failure modes. Minimum: 100 traces.

Batch Evaluation with evaluate_dataframe (Python)

Run evaluators across a DataFrame. The core 2.0 batch evaluation API.

Preferred: async_evaluate_dataframe

For batch evaluations (especially with LLM evaluators), prefer the async version for better throughput:

from phoenix.evals import async_evaluate_dataframe

results_df = await async_evaluate_dataframe(
    dataframe=df,              # pandas DataFrame with columns matching evaluator params
    evaluators=[eval1, eval2], # List of evaluators
    concurrency=5,             # Max concurrent LLM calls (default 3)
    exit_on_error=False,       # Optional: stop on first error (default True)
    max_retries=3,             # Optional: retry failed LLM calls (default 10)
)

Sync Version

from phoenix.evals import evaluate_dataframe

results_df = evaluate_dataframe(
    dataframe=df,              # pandas DataFrame with columns matching evaluator params
    evaluators=[eval1, eval2], # List of evaluators
    exit_on_error=False,       # Optional: stop on first error (default True)
    max_retries=3,             # Optional: retry failed LLM calls (default 10)
)

Result Column Format

async_evaluate_dataframe / evaluate_dataframe returns a copy of the input DataFrame with added columns. Result columns contain dicts, NOT raw numbers.

For each evaluator named "foo", two columns are added:

Column	Type	Contents
`foo_score`	`dict`	`{"name": "foo", "score": 1.0, "label": "True", "explanation": "...", "metadata": {...}, "kind": "code", "direction": "maximize"}`
`foo_execution_details`	`dict`	`{"status": "success", "exceptions": [], "execution_seconds": 0.001}`

Only non-None fields appear in the score dict.

Extracting Numeric Scores

# WRONG — these will fail or produce unexpected results
score = results_df["relevance"].mean()                    # KeyError!
score = results_df["relevance_score"].mean()              # Tries to average dicts!

# RIGHT — extract the numeric score from each dict
scores = results_df["relevance_score"].apply(
    lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
)
mean_score = scores.mean()

Extracting Labels

labels = results_df["relevance_score"].apply(
    lambda x: x.get("label", "") if isinstance(x, dict) else ""
)

Extracting Explanations (LLM evaluators)

explanations = results_df["relevance_score"].apply(
    lambda x: x.get("explanation", "") if isinstance(x, dict) else ""
)

Finding Failures

scores = results_df["relevance_score"].apply(
    lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
)
failed_mask = scores < 0.5
failures = results_df[failed_mask]

Input Mapping

Evaluators receive each row as a dict. Column names must match the evaluator's expected parameter names. If they don't match, use .bind() or bind_evaluator:

from phoenix.evals import bind_evaluator, create_evaluator, async_evaluate_dataframe

@create_evaluator(name="check", kind="code")
def check(response: str) -> bool:
    return len(response.strip()) > 0

# Option 1: Use .bind() method on the evaluator
check.bind(input_mapping={"response": "answer"})
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[check])

# Option 2: Use bind_evaluator function
bound = bind_evaluator(evaluator=check, input_mapping={"response": "answer"})
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[bound])

Or simply rename columns to match:

df = df.rename(columns={
    "attributes.input.value": "input",
    "attributes.output.value": "output",
})

DO NOT use run_evals

# WRONG — legacy 1.0 API
from phoenix.evals import run_evals
results = run_evals(dataframe=df, evaluators=[eval1])
# Returns List[DataFrame] — one per evaluator

# RIGHT — current 2.0 API
from phoenix.evals import async_evaluate_dataframe
results_df = await async_evaluate_dataframe(dataframe=df, evaluators=[eval1])
# Returns single DataFrame with {name}_score dict columns

Key differences:

run_evals returns a list of DataFrames (one per evaluator)
async_evaluate_dataframe returns a single DataFrame with all results merged
async_evaluate_dataframe uses {name}_score dict column format
async_evaluate_dataframe uses bind_evaluator for input mapping (not input_mapping= param)

Evaluators: Code Evaluators in Python

Deterministic evaluators without LLM. Fast, cheap, reproducible.

Basic Pattern

import re
import json
from phoenix.evals import create_evaluator

@create_evaluator(name="has_citation", kind="code")
def has_citation(output: str) -> bool:
    return bool(re.search(r'\[\d+\]', output))

@create_evaluator(name="json_valid", kind="code")
def json_valid(output: str) -> bool:
    try:
        json.loads(output)
        return True
    except json.JSONDecodeError:
        return False

Parameter Binding

Parameter	Description
`output`	Task output
`input`	Example input
`expected`	Expected output
`metadata`	Example metadata

@create_evaluator(name="matches_expected", kind="code")
def matches_expected(output: str, expected: dict) -> bool:
    return output.strip() == expected.get("answer", "").strip()

Common Patterns

Regex: re.search(pattern, output)
JSON schema: jsonschema.validate()
Keywords: keyword in output.lower()
Length: len(output.split())
Similarity: editdistance.eval() or Jaccard

Return Types

Return type	Result
`bool`	`True` → score=1.0, label="True"; `False` → score=0.0, label="False"
`float`/`int`	Used as the `score` value directly
`str` (short, ≤3 words)	Used as the `label` value
`str` (long, ≥4 words)	Used as the `explanation` value
`dict` with `score`/`label`/`explanation`	Mapped to Score fields directly
`Score` object	Used as-is

Important: Code vs LLM Evaluators

The @create_evaluator decorator wraps a plain Python function.

kind="code" (default): For deterministic evaluators that don't call an LLM.
kind="llm": Marks the evaluator as LLM-based, but you must implement the LLM

call inside the function. The decorator does not call an LLM for you.

For most LLM-based evaluation, prefer ClassificationEvaluator which handles the LLM call, structured output parsing, and explanations automatically:

from phoenix.evals import ClassificationEvaluator, LLM

relevance = ClassificationEvaluator(
    name="relevance",
    prompt_template="Is this relevant?\n{{input}}\n{{output}}\nAnswer:",
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"relevant": 1.0, "irrelevant": 0.0},
)

Pre-Built

from phoenix.client.experiments import create_evaluator
from phoenix.evals.metrics import MatchesRegex

date_format = MatchesRegex(pattern=r"\d{4}-\d{2}-\d{2}")


@create_evaluator(name="contains_any_keyword", kind="code")
def contains_any_keyword(output, expected):
    keywords = expected.get("keywords", [])
    return any(kw.lower() in str(output).lower() for kw in keywords)


@create_evaluator(name="json_parseable", kind="code")
def json_parseable(output):
    import json

    try:
        json.loads(output)
        return True
    except (json.JSONDecodeError, TypeError):
        return False

Evaluators: Code Evaluators in TypeScript

Deterministic evaluators without LLM. Fast, cheap, reproducible.

Basic Pattern

import { createEvaluator } from "@arizeai/phoenix-evals";

const containsCitation = createEvaluator<{ output: string }>(
  ({ output }) => /\[\d+\]/.test(output) ? 1 : 0,
  { name: "contains_citation", kind: "CODE" }
);

With Full Results (asExperimentEvaluator)

import { asExperimentEvaluator } from "@arizeai/phoenix-client/experiments";

const jsonValid = asExperimentEvaluator({
  name: "json_valid",
  kind: "CODE",
  evaluate: async ({ output }) => {
    try {
      JSON.parse(String(output));
      return { score: 1.0, label: "valid_json" };
    } catch (e) {
      return { score: 0.0, label: "invalid_json", explanation: String(e) };
    }
  },
});

Parameter Types

interface EvaluatorParams {
  input: Record<string, unknown>;
  output: unknown;
  expected: Record<string, unknown>;
  metadata: Record<string, unknown>;
}

Common Patterns

Regex: /pattern/.test(output)
JSON: JSON.parse() + zod schema
Keywords: output.includes(keyword)
Similarity: fastest-levenshtein

Evaluators: Custom Templates

Design LLM judge prompts.

Complete Template Pattern

TEMPLATE = """Evaluate faithfulness of the response to the context.

<context>{{context}}</context>
<response>{{output}}</response>

CRITERIA:
"faithful" = ALL claims supported by context
"unfaithful" = ANY claim NOT in context

EXAMPLES:
Context: "Price is $10" → Response: "It costs $10" → faithful
Context: "Price is $10" → Response: "About $15" → unfaithful

EDGE CASES:
- Empty context → cannot_evaluate
- "I don't know" when appropriate → faithful
- Partial faithfulness → unfaithful (strict)

Answer (faithful/unfaithful):"""

Template Structure

1. Task description 2. Input variables in XML tags 3. Criteria definitions 4. Examples (2-4 cases) 5. Edge cases 6. Output format

XML Tags

<question>{{input}}</question>
<response>{{output}}</response>
<context>{{context}}</context>
<reference>{{reference}}</reference>

Common Mistakes

Mistake	Fix
Vague criteria	Define each label exactly
No examples	Include 2-4 cases
Ambiguous format	Specify exact output
No edge cases	Address ambiguity

Evaluators: LLM Evaluators in Python

LLM evaluators use a language model to judge outputs. Use when criteria are subjective.

Quick Start

from phoenix.evals import ClassificationEvaluator, LLM

llm = LLM(provider="openai", model="gpt-4o")

HELPFULNESS_TEMPLATE = """Rate how helpful the response is.

<question>{{input}}</question>
<response>{{output}}</response>

"helpful" means directly addresses the question.
"not_helpful" means does not address the question.

Your answer (helpful/not_helpful):"""

helpfulness = ClassificationEvaluator(
    name="helpfulness",
    prompt_template=HELPFULNESS_TEMPLATE,
    llm=llm,
    choices={"not_helpful": 0, "helpful": 1}
)

Template Variables

Use XML tags to wrap variables for clarity:

Variable	XML Tag
`{{input}}`	`<question>{{input}}</question>`
`{{output}}`	`<response>{{output}}</response>`
`{{reference}}`	`<reference>{{reference}}</reference>`
`{{context}}`	`<context>{{context}}</context>`

create_classifier (Factory)

Shorthand factory that returns a ClassificationEvaluator. Prefer direct ClassificationEvaluator instantiation for more parameters/customization:

from phoenix.evals import create_classifier, LLM

relevance = create_classifier(
    name="relevance",
    prompt_template="""Is this response relevant to the question?
<question>{{input}}</question>
<response>{{output}}</response>
Answer (relevant/irrelevant):""",
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"relevant": 1.0, "irrelevant": 0.0},
)

Input Mapping

Column names must match template variables. Rename columns or use bind_evaluator:

# Option 1: Rename columns to match template variables
df = df.rename(columns={"user_query": "input", "ai_response": "output"})

# Option 2: Use bind_evaluator
from phoenix.evals import bind_evaluator

bound = bind_evaluator(
    evaluator=helpfulness,
    input_mapping={"input": "user_query", "output": "ai_response"},
)

Running

from phoenix.evals import evaluate_dataframe

results_df = evaluate_dataframe(dataframe=df, evaluators=[helpfulness])

Best Practices

1. Be specific - Define exactly what pass/fail means 2. Include examples - Show concrete cases for each label 3. Explanations by default - ClassificationEvaluator includes explanations automatically 4. Study built-in prompts - See phoenix.evals.__generated__.classification_evaluator_configs for examples of well-structured evaluation prompts (Faithfulness, Correctness, DocumentRelevance, etc.)

Evaluators: LLM Evaluators in TypeScript

LLM evaluators use a language model to judge outputs. Uses Vercel AI SDK.

Quick Start

import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

const helpfulness = await createClassificationEvaluator<{
  input: string;
  output: string;
}>({
  name: "helpfulness",
  model: openai("gpt-4o"),
  promptTemplate: `Rate helpfulness.
<question>{{input}}</question>
<response>{{output}}</response>
Answer (helpful/not_helpful):`,
  choices: { not_helpful: 0, helpful: 1 },
});

Template Variables

Use XML tags: <question>{{input}}</question>, <response>{{output}}</response>, <context>{{context}}</context>

Custom Evaluator with asExperimentEvaluator

import { asExperimentEvaluator } from "@arizeai/phoenix-client/experiments";

const customEval = asExperimentEvaluator({
  name: "custom",
  kind: "LLM",
  evaluate: async ({ input, output }) => {
    // Your LLM call here
    return { score: 1.0, label: "pass", explanation: "..." };
  },
});

Pre-Built Evaluators

import { createFaithfulnessEvaluator } from "@arizeai/phoenix-evals";

const faithfulnessEvaluator = createFaithfulnessEvaluator({
  model: openai("gpt-4o"),
});

Best Practices

Be specific about criteria
Include examples in prompts
Use <thinking> for chain of thought

Evaluators: Overview

When and how to build automated evaluators.

Decision Framework

Should I Build an Evaluator?
        │
        ▼
Can I fix it with a prompt change?
    YES → Fix the prompt first
    NO  → Is this a recurring issue?
          YES → Build evaluator
          NO  → Add to watchlist

Don't automate prematurely. Many issues are simple prompt fixes.

Evaluator Requirements

1. Clear criteria - Specific, not "Is it good?" 2. Labeled test set - 100+ examples with human labels 3. Measured accuracy - Know TPR/TNR before deploying

Evaluator Lifecycle

1. Discover - Error analysis reveals pattern 2. Design - Define criteria and test cases 3. Implement - Build code or LLM evaluator 4. Calibrate - Validate against human labels 5. Deploy - Add to experiment/CI pipeline 6. Monitor - Track accuracy over time 7. Maintain - Update as product evolves

What NOT to Automate

Rare issues - <5 instances? Watchlist, don't build
Quick fixes - Fixable by prompt change? Fix it
Evolving criteria - Stabilize definition first

Evaluators: Pre-Built

Use for exploration only. Validate before production.

Python

from phoenix.evals import LLM
from phoenix.evals.metrics import FaithfulnessEvaluator

llm = LLM(provider="openai", model="gpt-4o")
faithfulness_eval = FaithfulnessEvaluator(llm=llm)

Note: HallucinationEvaluator is deprecated. Use FaithfulnessEvaluator instead. It uses "faithful"/"unfaithful" labels with score 1.0 = faithful.

TypeScript

import { createHallucinationEvaluator } from "@arizeai/phoenix-evals";
import { openai } from "@ai-sdk/openai";

const hallucinationEval = createHallucinationEvaluator({ model: openai("gpt-4o") });

Available (2.0)

Evaluator	Type	Description
`FaithfulnessEvaluator`	LLM	Is the response faithful to the context?
`CorrectnessEvaluator`	LLM	Is the response correct?
`DocumentRelevanceEvaluator`	LLM	Are retrieved documents relevant?
`ToolSelectionEvaluator`	LLM	Did the agent select the right tool?
`ToolInvocationEvaluator`	LLM	Did the agent invoke the tool correctly?
`ToolResponseHandlingEvaluator`	LLM	Did the agent handle the tool response well?
`MatchesRegex`	Code	Does output match a regex pattern?
`PrecisionRecallFScore`	Code	Precision/recall/F-score metrics
`exact_match`	Code	Exact string match

Legacy evaluators (HallucinationEvaluator, QAEvaluator, RelevanceEvaluator, ToxicityEvaluator, SummarizationEvaluator) are in phoenix.evals.legacy and deprecated.

When to Use

Situation	Recommendation
Exploration	Find traces to review
Find outliers	Sort by scores
Production	Validate first (>80% human agreement)
Domain-specific	Build custom

Exploration Pattern

from phoenix.evals import evaluate_dataframe

results_df = evaluate_dataframe(dataframe=traces, evaluators=[faithfulness_eval])

# Score columns contain dicts — extract numeric scores
scores = results_df["faithfulness_score"].apply(
    lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0
)
low_scores = results_df[scores < 0.5]   # Review these
high_scores = results_df[scores > 0.9]  # Also sample

Validation Required

from sklearn.metrics import classification_report

print(classification_report(human_labels, evaluator_results["label"]))
# Target: >80% agreement

Evaluators: RAG Systems

RAG has two distinct components requiring different evaluation approaches.

Two-Phase Evaluation

RETRIEVAL                    GENERATION
─────────                    ──────────
Query → Retriever → Docs     Docs + Query → LLM → Answer
         │                              │
    IR Metrics              LLM Judges / Code Checks

Debug retrieval first using IR metrics, then tackle generation quality.

Retrieval Evaluation (IR Metrics)

Use traditional information retrieval metrics:

Metric	What It Measures
Recall@k	Of all relevant docs, how many in top k?
Precision@k	Of k retrieved docs, how many relevant?
MRR	How high is first relevant doc?
NDCG	Quality weighted by position

# Requires query-document relevance labels
def recall_at_k(retrieved_ids, relevant_ids, k=5):
    retrieved_set = set(retrieved_ids[:k])
    relevant_set = set(relevant_ids)
    if not relevant_set:
        return 0.0
    return len(retrieved_set & relevant_set) / len(relevant_set)

Creating Retrieval Test Data

Generate query-document pairs synthetically:

# Reverse process: document → questions that document answers
def generate_retrieval_test(documents):
    test_pairs = []
    for doc in documents:
        # Extract facts, generate questions
        questions = llm(f"Generate 3 questions this document answers:\n{doc}")
        for q in questions:
            test_pairs.append({"query": q, "relevant_doc_id": doc.id})
    return test_pairs

Generation Evaluation

Use LLM judges for qualities code can't measure:

Eval	Question
Faithfulness	Are all claims supported by retrieved context?
Relevance	Does answer address the question?
Completeness	Does answer cover key points from context?

from phoenix.evals import ClassificationEvaluator, LLM

FAITHFULNESS_TEMPLATE = """Given the context and answer, is every claim in the answer supported by the context?

<context>{{context}}</context>
<answer>{{output}}</answer>

"faithful" = ALL claims supported by context
"unfaithful" = ANY claim NOT in context

Answer (faithful/unfaithful):"""

faithfulness = ClassificationEvaluator(
    name="faithfulness",
    prompt_template=FAITHFULNESS_TEMPLATE,
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"unfaithful": 0, "faithful": 1}
)

RAG Failure Taxonomy

Common failure modes to evaluate:

retrieval_failures:
  - no_relevant_docs: Query returns unrelated content
  - partial_retrieval: Some relevant docs missed
  - wrong_chunk: Right doc, wrong section

generation_failures:
  - hallucination: Claims not in retrieved context
  - ignored_context: Answer doesn't use retrieved docs
  - incomplete: Missing key information from context
  - wrong_synthesis: Misinterprets or miscombines sources

Evaluation Order

1. Retrieval first - If wrong docs, generation will fail 2. Faithfulness - Is answer grounded in context? 3. Answer quality - Does answer address the question?

Fix retrieval problems before debugging generation.

Experiments: Datasets in Python

Creating and managing evaluation datasets.

Creating Datasets

create_dataset() upserts: if a dataset with the same name already exists it is updated in-place; re-running with identical inputs is a no-op.

from phoenix.client import Client

client = Client()

# From examples
dataset = client.datasets.create_dataset(
    name="qa-test-v1",
    examples=[
        {
            "input": {"question": "What is 2+2?"},
            "output": {"answer": "4"},
            "metadata": {"category": "math"},
        },
    ],
)

# With stable example IDs for targeted updates across uploads
dataset = client.datasets.create_dataset(
    name="qa-test-v1",
    examples=[
        {
            "id": "q-001",                      # stable ID — server updates this row, not inserts
            "input": {"question": "What is 2+2?"},
            "output": {"answer": "4"},
            "metadata": {"category": "math"},
        },
    ],
)

# From DataFrame
dataset = client.datasets.create_dataset(
    dataframe=df,
    name="qa-test-v1",
    input_keys=["question"],
    output_keys=["answer"],
    metadata_keys=["category"],
    split_key="split",        # single split column (use this instead of deprecated split_keys)
    example_id_key="id",      # column containing stable example IDs
)

From Production Traces

spans_df = client.spans.get_spans_dataframe(project_identifier="my-app")

dataset = client.datasets.create_dataset(
    dataframe=spans_df[["input.value", "output.value"]],
    name="production-sample-v1",
    input_keys=["input.value"],
    output_keys=["output.value"],
)

Retrieving Datasets

dataset = client.datasets.get_dataset(name="qa-test-v1")
df = dataset.to_dataframe()

Key Parameters

Parameter	Description
`input_keys`	Columns for task input
`output_keys`	Columns for expected output
`metadata_keys`	Additional context
`example_id_key`	Column with stable example IDs; server updates the matching row instead of inserting
`split_key`	Single column for split assignment (replaces deprecated `split_keys`)
`split_keys`	Deprecated — use `split_key` (singular) instead

Using Evaluators in Experiments

Evaluators as experiment evaluators

Pass phoenix-evals evaluators directly to run_experiment as the evaluators argument:

from functools import partial
from phoenix.client import AsyncClient
from phoenix.evals import ClassificationEvaluator, LLM, bind_evaluator

# Define an LLM evaluator
refusal = ClassificationEvaluator(
    name="refusal",
    prompt_template="Is this a refusal?\nQuestion: {{query}}\nResponse: {{response}}",
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"refusal": 0, "answer": 1},
)

# Bind to map dataset columns to evaluator params
refusal_evaluator = bind_evaluator(refusal, {"query": "input.query", "response": "output"})

# Define experiment task
async def run_rag_task(input, rag_engine):
    return rag_engine.query(input["query"])

# Run experiment with the evaluator
experiment = await AsyncClient().experiments.run_experiment(
    dataset=ds,
    task=partial(run_rag_task, rag_engine=query_engine),
    experiment_name="baseline",
    evaluators=[refusal_evaluator],
    concurrency=10,
)

Evaluators as the task (meta evaluation)

Use an LLM evaluator as the experiment task to test the evaluator itself against human annotations:

from phoenix.evals import create_evaluator

# The evaluator IS the task being tested
def run_refusal_eval(input, evaluator):
    result = evaluator.evaluate(input)
    return result[0]

# A simple heuristic checks judge vs human agreement
@create_evaluator(name="exact_match")
def exact_match(output, expected):
    return float(output["score"]) == float(expected["refusal_score"])

# Run: evaluator is the task, exact_match evaluates it
experiment = await AsyncClient().experiments.run_experiment(
    dataset=annotated_dataset,
    task=partial(run_refusal_eval, evaluator=refusal),
    experiment_name="judge-v1",
    evaluators=[exact_match],
    concurrency=10,
)

This pattern lets you iterate on evaluator prompts until they align with human judgments. See tutorials/evals/evals-2/evals_2.0_rag_demo.ipynb for a full worked example.

Best Practices

Upsert by default: Re-upload to the same name to update in-place; use example_id_key so the server targets specific rows instead of treating every upload as new data
Versioning: Version with tags or new names (e.g., qa-test-v2) when you want a clean snapshot, not just incremental edits
Metadata: Track source, category, difficulty
Balance: Ensure diverse coverage across categories
Avoid `split_keys`: Pass split_key (singular) — split_keys is deprecated and emits a DeprecationWarning

Experiments: Datasets in TypeScript

Creating and managing evaluation datasets.

Creating Datasets

createDataset() upserts: if a dataset with the same name already exists it is updated to match the provided examples. Re-running with identical inputs is a no-op.

import { createClient } from "@arizeai/phoenix-client";
import { createDataset } from "@arizeai/phoenix-client/datasets";

const client = createClient();

const { datasetId } = await createDataset({
  client,
  name: "qa-test-v1",
  examples: [
    {
      input: { question: "What is 2+2?" },
      output: { answer: "4" },
      metadata: { category: "math" },
    },
  ],
});

// With stable example IDs for targeted updates across uploads
const { datasetId } = await createDataset({
  client,
  name: "qa-test-v1",
  examples: [
    {
      id: "q-001",                        // stable ID — server updates this row, not inserts
      input: { question: "What is 2+2?" },
      output: { answer: "4" },
      metadata: { category: "math" },
    },
  ],
});

Example Structure

interface Example {
  input: Record<string, unknown>;    // Task input
  output?: Record<string, unknown> | null;  // Expected output
  metadata?: Record<string, unknown> | null; // Additional context
  splits?: string | string[] | null; // Split assignment ("train", ["train", "easy"], etc.)
  spanId?: string | null;            // OTEL span ID to link back to source trace
  id?: string | null;                // Stable user-provided ID; server updates matching row
}

From Production Traces

import { getSpans } from "@arizeai/phoenix-client/spans";

const { spans } = await getSpans({
  project: { projectName: "my-app" },
  parentId: null, // root spans only
  limit: 100,
});

const examples = spans.map((span) => ({
  input: { query: span.attributes?.["input.value"] },
  output: { response: span.attributes?.["output.value"] },
  metadata: { spanId: span.context.span_id },
}));

await createDataset({ client, name: "production-sample", examples });

Retrieving Datasets

import { getDataset, listDatasets } from "@arizeai/phoenix-client/datasets";

const dataset = await getDataset({ client, datasetId: "..." });
const all = await listDatasets({ client });

Best Practices

Upsert by default: Re-upload to the same name to update in-place; use id on examples so the server targets specific rows instead of treating every upload as new data
Versioning: Version with new names (e.g., qa-test-v2) when you want a clean snapshot, not just incremental edits
Metadata: Track source, category, provenance
Type safety: Use the Example type from @arizeai/phoenix-client/datasets

Experiments: Overview

Systematic testing of AI systems with datasets, tasks, and evaluators.

Structure

DATASET     → Examples: {input, expected_output, metadata}
TASK        → function(input) → output
EVALUATORS  → (input, output, expected) → score
EXPERIMENT  → Run task on all examples, score results

Basic Usage

from phoenix.client import Client

client = Client()
experiment = client.experiments.run_experiment(
    dataset=my_dataset,
    task=my_task,
    evaluators=[accuracy, faithfulness],
    experiment_name="improved-retrieval-v2",
)

print(experiment.aggregate_scores)
# {'accuracy': 0.85, 'faithfulness': 0.92}

Workflow

1. Create dataset - From traces, synthetic data, or manual curation 2. Define task - The function to test (your LLM pipeline) 3. Select evaluators - Code and/or LLM-based 4. Run experiment - Execute and score 5. Analyze & iterate - Review, modify task, re-run

Dry Runs

Test setup before full execution:

experiment = client.experiments.run_experiment(
    dataset=dataset,
    task=task,
    evaluators=evaluators,
    dry_run=3,
)  # Just 3 examples

Async Usage

Use AsyncClient when your task or evaluators make network calls and you want higher throughput:

from phoenix.client import AsyncClient

client = AsyncClient()
experiment = await client.experiments.run_experiment(
    dataset=my_dataset,
    task=my_async_task,
    evaluators=[accuracy, faithfulness],
    experiment_name="improved-retrieval-v2",
)

Best Practices

Name meaningfully: "improved-retrieval-v2-2024-01-15" not "test"
Version datasets: Don't modify existing
Multiple evaluators: Combine perspectives

Experiments: Running Experiments in Python

Execute experiments with run_experiment.

Basic Usage

from phoenix.client import Client
from phoenix.client.experiments import run_experiment

client = Client()
dataset = client.datasets.get_dataset(name="qa-test-v1")

def my_task(example):
    return call_llm(example.input["question"])

def exact_match(output, expected):
    return 1.0 if output.strip().lower() == expected["answer"].strip().lower() else 0.0

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[exact_match],
    experiment_name="qa-experiment-v1",
)

Task Functions

# Basic task
def task(example):
    return call_llm(example.input["question"])

# With context (RAG)
def rag_task(example):
    return call_llm(f"Context: {example.input['context']}\nQ: {example.input['question']}")

Evaluator Parameters

Parameter	Access
`output`	Task output
`expected`	Example expected output
`input`	Example input
`metadata`	Example metadata

Options

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=evaluators,
    experiment_name="my-experiment",
    dry_run=3,       # Test with 3 examples
    repetitions=3,   # Run each example 3 times
)

Results

print(experiment.aggregate_scores)
# {'accuracy': 0.85, 'faithfulness': 0.92}

for run in experiment.runs:
    print(run.output, run.scores)

Stability

Single-run scores are noisy when either the task or the evaluator is non-deterministic — an LLM call, tool use, streaming output, an LLM-as-judge. On a small dataset, that per-run noise can swamp the signal from a prompt change.

Averaging over repetitions lets the score you report reflect the prompt rather than the sampling noise:

run_experiment(
    # ...
    repetitions=3,
)

Things to consider:

Reach for repetitions when the task or the evaluator is an LLM call and the dataset is small.
Prefer repetitions when per-example cost is low and you mostly want to settle the score; prefer growing the dataset when you also need to cover more behaviors.
Skip repetitions when both the task and the evaluator are deterministic (e.g. string comparison against a ground truth) — a single run is the answer.

Consider adding stability when:

Repeat runs of the same experiment drift in ways that feel larger than the differences you're trying to measure.
A prompt change flips example labels in ways that don't track with how the outputs actually changed.
The judge's reasoning on the same output reads differently from one run to the next.

Repetitions are also what repetitions=1 (default) silently relies on — don't trust a tuning decision based on a single 10-example run.

Add Evaluations Later

from phoenix.client.experiments import evaluate_experiment

evaluate_experiment(experiment=experiment, evaluators=[new_evaluator])

Experiments: Running Experiments in TypeScript

Execute experiments with runExperiment.

Basic Usage

import { createClient } from "@arizeai/phoenix-client";
import {
  runExperiment,
  asExperimentEvaluator,
} from "@arizeai/phoenix-client/experiments";

const client = createClient();

const task = async (example: { input: Record<string, unknown> }) => {
  return await callLLM(example.input.question as string);
};

const exactMatch = asExperimentEvaluator({
  name: "exact_match",
  kind: "CODE",
  evaluate: async ({ output, expected }) => ({
    score: output === expected?.answer ? 1.0 : 0.0,
    label: output === expected?.answer ? "match" : "no_match",
  }),
});

const experiment = await runExperiment({
  client,
  experimentName: "qa-experiment-v1",
  dataset: { datasetId: "your-dataset-id" },
  task,
  evaluators: [exactMatch],
});

Task Functions

// Basic task
const task = async (example) => await callLLM(example.input.question as string);

// With context (RAG)
const ragTask = async (example) => {
  const prompt = `Context: ${example.input.context}\nQ: ${example.input.question}`;
  return await callLLM(prompt);
};

Evaluator Parameters

interface EvaluatorParams {
  input: Record<string, unknown>;
  output: unknown;
  expected: Record<string, unknown>;
  metadata: Record<string, unknown>;
}

Options

const experiment = await runExperiment({
  client,
  experimentName: "my-experiment",
  dataset: { datasetName: "qa-test-v1" },
  task,
  evaluators,
  repetitions: 3, // Run each example 3 times
  maxConcurrency: 5, // Limit concurrent executions
});

Stability

Averaging over repetitions lets the score you report reflect the prompt rather than the sampling noise:

await runExperiment({
  // ...
  repetitions: 3,
});

Things to consider:

Reach for repetitions when the task or the evaluator is an LLM call and the dataset is small.
Prefer repetitions when per-example cost is low and you mostly want to settle the score; prefer growing the dataset when you also need to cover more behaviors.
Skip repetitions when both the task and the evaluator are deterministic (e.g. string comparison against a ground truth) — a single run is the answer.

Consider adding stability when:

Repeat runs of the same experiment drift in ways that feel larger than the differences you're trying to measure.
A prompt change flips example labels in ways that don't track with how the outputs actually changed.
The judge's reasoning on the same output reads differently from one run to the next.

Repetitions are also what repetitions: 1 (default) silently relies on — don't trust a tuning decision based on a single 10-example run.

Add Evaluations Later

import { evaluateExperiment } from "@arizeai/phoenix-client/experiments";

await evaluateExperiment({ client, experiment, evaluators: [newEvaluator] });

Experiments: Generating Synthetic Test Data

Creating diverse, targeted test data for evaluation.

Dimension-Based Approach

Define axes of variation, then generate combinations:

dimensions = {
    "issue_type": ["billing", "technical", "shipping"],
    "customer_mood": ["frustrated", "neutral", "happy"],
    "complexity": ["simple", "moderate", "complex"],
}

Two-Step Generation

1. Generate tuples (combinations of dimension values) 2. Convert to natural queries (separate LLM call per tuple)

# Step 1: Create tuples
tuples = [
    ("billing", "frustrated", "complex"),
    ("shipping", "neutral", "simple"),
]

# Step 2: Convert to natural query
def tuple_to_query(t):
    prompt = f"""Generate a realistic customer message:
    Issue: {t[0]}, Mood: {t[1]}, Complexity: {t[2]}
    
    Write naturally, include typos if appropriate. Don't be formulaic."""
    return llm(prompt)

Target Failure Modes

Dimensions should target known failures from error analysis:

# From error analysis findings
dimensions = {
    "timezone": ["EST", "PST", "UTC", "ambiguous"],  # Known failure
    "date_format": ["ISO", "US", "EU", "relative"],   # Known failure
}

Quality Control

Validate: Check for placeholder text, minimum length
Deduplicate: Remove near-duplicate queries using embeddings
Balance: Ensure coverage across dimension values

When to Use

Use Synthetic	Use Real Data
Limited production data	Sufficient traces
Testing edge cases	Validating actual behavior
Pre-launch evals	Post-launch monitoring

Sample Sizes

Purpose	Size
Initial exploration	50-100
Comprehensive eval	100-500
Per-dimension	10-20 per combination

Experiments: Generating Synthetic Test Data (TypeScript)

Creating diverse, targeted test data for evaluation.

Dimension-Based Approach

Define axes of variation, then generate combinations:

const dimensions = {
  issueType: ["billing", "technical", "shipping"],
  customerMood: ["frustrated", "neutral", "happy"],
  complexity: ["simple", "moderate", "complex"],
};

Two-Step Generation

1. Generate tuples (combinations of dimension values) 2. Convert to natural queries (separate LLM call per tuple)

import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";

// Step 1: Create tuples
type Tuple = [string, string, string];
const tuples: Tuple[] = [
  ["billing", "frustrated", "complex"],
  ["shipping", "neutral", "simple"],
];

// Step 2: Convert to natural query
async function tupleToQuery(t: Tuple): Promise<string> {
  const { text } = await generateText({
    model: openai("gpt-4o"),
    prompt: `Generate a realistic customer message:
    Issue: ${t[0]}, Mood: ${t[1]}, Complexity: ${t[2]}
    
    Write naturally, include typos if appropriate. Don't be formulaic.`,
  });
  return text;
}

Target Failure Modes

Dimensions should target known failures from error analysis:

// From error analysis findings
const dimensions = {
  timezone: ["EST", "PST", "UTC", "ambiguous"], // Known failure
  dateFormat: ["ISO", "US", "EU", "relative"], // Known failure
};

Quality Control

Validate: Check for placeholder text, minimum length
Deduplicate: Remove near-duplicate queries using embeddings
Balance: Ensure coverage across dimension values

function validateQuery(query: string): boolean {
  const minLength = 20;
  const hasPlaceholder = /\[.*?\]|<.*?>/.test(query);
  return query.length >= minLength && !hasPlaceholder;
}

When to Use

Use Synthetic	Use Real Data
Limited production data	Sufficient traces
Testing edge cases	Validating actual behavior
Pre-launch evals	Post-launch monitoring

Sample Sizes

Purpose	Size
Initial exploration	50-100
Comprehensive eval	100-500
Per-dimension	10-20 per combination

Anti-Patterns

Common mistakes and fixes.

Anti-Pattern	Problem	Fix
Generic metrics	Pre-built scores don't match your failures	Build from error analysis
Vibe-based	No quantification	Measure with experiments
Ignoring humans	Uncalibrated LLM judges	Validate >80% TPR/TNR
Premature automation	Evaluators for imagined problems	Let observed failures drive
Saturation blindness	100% pass = no signal	Keep capability evals at 50-80%
Similarity metrics	BERTScore/ROUGE for generation	Use for retrieval only
Model switching	Hoping a model works better	Error analysis first
Single-run scoring	LLM judges and non-deterministic tasks add per-run noise that can drown the signal from a prompt change on a small dataset	Set `repetitions` on `runExperiment` (or grow the dataset) when the task or judge is an LLM call

Quantify Changes

from phoenix.client import Client

client = Client()
baseline = client.experiments.run_experiment(dataset=dataset, task=old_prompt, evaluators=evaluators)
improved = client.experiments.run_experiment(dataset=dataset, task=new_prompt, evaluators=evaluators)
print(f"Improvement: {improved.pass_rate - baseline.pass_rate:+.1%}")

Don't Use Similarity for Generation

# BAD
score = bertscore(output, reference)

# GOOD
correct_facts = check_facts_against_source(output, context)

Error Analysis Before Model Change

# BAD
for model in models:
    results = test(model)

# GOOD
failures = analyze_errors(results)
# Then decide if model change is warranted

Model Selection

Error analysis first, model changes last.

Decision Tree

Performance Issue?
       │
       ▼
Error analysis suggests model problem?
    NO  → Fix prompts, retrieval, tools
    YES → Is it a capability gap?
          YES → Consider model change
          NO  → Fix the actual problem

Judge Model Selection

Principle	Action
Start capable	Use gpt-4o first
Optimize later	Test cheaper after criteria stable
Same model OK	Judge does different task

# Start with capable model
judge = ClassificationEvaluator(
    llm=LLM(provider="openai", model="gpt-4o"),
    ...
)

# After validation, test cheaper
judge_cheap = ClassificationEvaluator(
    llm=LLM(provider="openai", model="gpt-4o-mini"),
    ...
)
# Compare TPR/TNR on same test set

Don't Model Shop

from phoenix.client import Client

client = Client()

# BAD
for model in ["gpt-4o", "claude-3", "gemini-pro"]:
    results = client.experiments.run_experiment(
        dataset=dataset,
        task=lambda input, _model=model: task(input, model=_model),
        evaluators=evaluators,
    )

# GOOD
failures = analyze_errors(results)
# "Ignores context" → Fix prompt
# "Can't do math" → Maybe try better model

When Model Change Is Warranted

Failures persist after prompt optimization
Capability gaps (reasoning, math, code)
Error analysis confirms model limitation

Fundamentals

Application-specific tests for AI systems. Code first, LLM for nuance, human for truth.

Evaluator Types

Type	Speed	Cost	Use Case
Code	Fast	Cheap	Regex, JSON, format, exact match
LLM	Medium	Medium	Subjective quality, complex criteria
Human	Slow	Expensive	Ground truth, calibration

Decision: Code first → LLM only when code can't capture criteria → Human for calibration.

Score Structure

Property	Required	Description
`name`	Yes	Evaluator name
`kind`	Yes	`"code"`, `"llm"`, `"human"`
`score`	No*	0-1 numeric
`label`	No*	`"pass"`, `"fail"`
`explanation`	No	Rationale

*One of score or label required.

Binary > Likert

Use pass/fail, not 1-5 scales. Clearer criteria, easier calibration.

# Multiple binary checks instead of one Likert scale
evaluators = [
    AnswersQuestion(),    # Yes/No
    UsesContext(),        # Yes/No
    NoHallucination(),    # Yes/No
]

Quick Patterns

Code Evaluator

from phoenix.evals import create_evaluator

@create_evaluator(name="has_citation", kind="code")
def has_citation(output: str) -> bool:
    return bool(re.search(r'\[\d+\]', output))

LLM Evaluator

from phoenix.evals import ClassificationEvaluator, LLM

evaluator = ClassificationEvaluator(
    name="helpfulness",
    prompt_template="...",
    llm=LLM(provider="openai", model="gpt-4o"),
    choices={"not_helpful": 0, "helpful": 1}
)

Run Experiment

from phoenix.client.experiments import run_experiment

experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[evaluator1, evaluator2],
)
print(experiment.aggregate_scores)

Observe: Sampling Strategies

How to efficiently sample production traces for review.

Strategies

1. Failure-Focused (Highest Priority)

errors = spans_df[spans_df["status_code"] == "ERROR"]
negative_feedback = spans_df[spans_df["feedback"] == "negative"]

2. Outliers

long_responses = spans_df.nlargest(50, "response_length")
slow_responses = spans_df.nlargest(50, "latency_ms")

3. Stratified (Coverage)

# Sample equally from each category
by_query_type = spans_df.groupby("metadata.query_type").apply(
    lambda x: x.sample(min(len(x), 20))
)

4. Metric-Guided

# Review traces flagged by automated evaluators
flagged = spans_df[eval_results["label"] == "hallucinated"]
borderline = spans_df[(eval_results["score"] > 0.3) & (eval_results["score"] < 0.7)]

Building a Review Queue

def build_review_queue(spans_df, max_traces=100):
    queue = pd.concat([
        spans_df[spans_df["status_code"] == "ERROR"],
        spans_df[spans_df["feedback"] == "negative"],
        spans_df.nlargest(10, "response_length"),
        spans_df.sample(min(30, len(spans_df))),
    ]).drop_duplicates("span_id").head(max_traces)
    return queue

Sample Size Guidelines

Purpose	Size
Initial exploration	50-100
Error analysis	100+ (until saturation)
Golden dataset	100-500
Judge calibration	100+ per class

Saturation: Stop when new traces show the same failure patterns.

Trace-Level Sampling

When you need whole requests (all spans per trace), use get_traces:

from phoenix.client import Client
from datetime import datetime, timedelta

client = Client()

# Recent traces with full span trees
traces = client.traces.get_traces(
    project_identifier="my-app",
    limit=100,
    include_spans=True,
)

# Time-windowed sampling (e.g., last hour)
traces = client.traces.get_traces(
    project_identifier="my-app",
    start_time=datetime.now() - timedelta(hours=1),
    limit=50,
    include_spans=True,
)

# Filter by session (multi-turn conversations)
traces = client.traces.get_traces(
    project_identifier="my-app",
    session_id="user-session-abc",
    include_spans=True,
)

# Sort by latency to find slowest requests
traces = client.traces.get_traces(
    project_identifier="my-app",
    sort="latency_ms",
    order="desc",
    limit=50,
)

Observe: Sampling Strategies (TypeScript)

How to efficiently sample production traces for review.

Strategies

1. Failure-Focused (Highest Priority)

Use server-side filters to fetch only what you need:

import { getSpans } from "@arizeai/phoenix-client/spans";

// Server-side filter — only ERROR spans are returned
const { spans: errors } = await getSpans({
  project: { projectName: "my-project" },
  statusCode: "ERROR",
  limit: 100,
});

// Fetch only LLM spans
const { spans: llmSpans } = await getSpans({
  project: { projectName: "my-project" },
  spanKind: "LLM",
  limit: 100,
});

// Filter by span name
const { spans: chatSpans } = await getSpans({
  project: { projectName: "my-project" },
  name: "chat_completion",
  limit: 100,
});

2. Outliers

const { spans } = await getSpans({
  project: { projectName: "my-project" },
  limit: 200,
});
const latency = (s: (typeof spans)[number]) =>
  new Date(s.end_time).getTime() - new Date(s.start_time).getTime();
const sorted = [...spans].sort((a, b) => latency(b) - latency(a));
const slowResponses = sorted.slice(0, 50);

3. Stratified (Coverage)

// Sample equally from each category
function stratifiedSample<T>(items: T[], groupBy: (item: T) => string, perGroup: number): T[] {
  const groups = new Map<string, T[]>();
  for (const item of items) {
    const key = groupBy(item);
    if (!groups.has(key)) groups.set(key, []);
    groups.get(key)!.push(item);
  }
  return [...groups.values()].flatMap((g) => g.slice(0, perGroup));
}

const { spans } = await getSpans({
  project: { projectName: "my-project" },
  limit: 500,
});
const byQueryType = stratifiedSample(spans, (s) => s.attributes?.["metadata.query_type"] ?? "unknown", 20);

4. Metric-Guided

import { getSpanAnnotations } from "@arizeai/phoenix-client/spans";

// Fetch annotations for your spans, then filter by label
const { annotations } = await getSpanAnnotations({
  project: { projectName: "my-project" },
  spanIds: spans.map((s) => s.context.span_id),
  includeAnnotationNames: ["hallucination"],
});

const flaggedSpanIds = new Set(
  annotations.filter((a) => a.result?.label === "hallucinated").map((a) => a.span_id)
);
const flagged = spans.filter((s) => flaggedSpanIds.has(s.context.span_id));

Trace-Level Sampling

When you need whole requests (all spans in a trace), use getTraces:

import { getTraces } from "@arizeai/phoenix-client/traces";

// Recent traces with full span trees
const { traces } = await getTraces({
  project: { projectName: "my-project" },
  limit: 100,
  includeSpans: true,
});

// Filter by session (e.g., multi-turn conversations)
const { traces: sessionTraces } = await getTraces({
  project: { projectName: "my-project" },
  sessionId: "user-session-abc",
  includeSpans: true,
});

// Time-windowed sampling
const { traces: recentTraces } = await getTraces({
  project: { projectName: "my-project" },
  startTime: new Date(Date.now() - 60 * 60 * 1000), // last hour
  limit: 50,
  includeSpans: true,
});

Building a Review Queue

// Combine server-side filters into a review queue
const { spans: errorSpans } = await getSpans({
  project: { projectName: "my-project" },
  statusCode: "ERROR",
  limit: 30,
});
const { spans: allSpans } = await getSpans({
  project: { projectName: "my-project" },
  limit: 100,
});
const random = allSpans.sort(() => Math.random() - 0.5).slice(0, 30);

const combined = [...errorSpans, ...random];
const unique = [...new Map(combined.map((s) => [s.context.span_id, s])).values()];
const reviewQueue = unique.slice(0, 100);

Sample Size Guidelines

Purpose	Size
Initial exploration	50-100
Error analysis	100+ (until saturation)
Golden dataset	100-500
Judge calibration	100+ per class

Saturation: Stop when new traces show the same failure patterns.

Observe: Tracing Setup

Configure tracing to capture data for evaluation.

Quick Setup

# Python
from phoenix.otel import register

register(project_name="my-app", auto_instrument=True)

// TypeScript
import { registerPhoenix } from "@arizeai/phoenix-otel";

registerPhoenix({ projectName: "my-app", autoInstrument: true });

Essential Attributes

Attribute	Why It Matters
`input.value`	User's request
`output.value`	Response to evaluate
`retrieval.documents`	Context for faithfulness
`tool.name`, `tool.parameters`	Agent evaluation
`llm.model_name`	Track by model

Custom Attributes for Evals

span.set_attribute("metadata.client_type", "enterprise")
span.set_attribute("metadata.query_category", "billing")

Exporting for Evaluation

Spans (Python — DataFrame)

from phoenix.client import Client

# Client() works for local Phoenix (falls back to env vars or localhost:6006)
# For remote/cloud: Client(base_url="https://app.phoenix.arize.com", api_key="...")
client = Client()
spans_df = client.spans.get_spans_dataframe(
    project_identifier="my-app",  # NOT project_name= (deprecated)
    root_spans_only=True,
)

dataset = client.datasets.create_dataset(
    name="error-analysis-set",
    dataframe=spans_df[["input.value", "output.value"]],
    input_keys=["input.value"],
    output_keys=["output.value"],
)

Spans (TypeScript)

import { getSpans } from "@arizeai/phoenix-client/spans";

const { spans } = await getSpans({
  project: { projectName: "my-app" },
  parentId: null, // root spans only
  limit: 100,
});

Traces (Python — structured)

Use get_traces when you need full trace trees (e.g., multi-turn conversations, agent workflows):

from datetime import datetime, timedelta

traces = client.traces.get_traces(
    project_identifier="my-app",
    start_time=datetime.now() - timedelta(hours=24),
    include_spans=True,  # includes all spans per trace
    limit=100,
)
# Each trace has: trace_id, start_time, end_time, spans (when include_spans=True)

Traces (TypeScript)

import { getTraces } from "@arizeai/phoenix-client/traces";

const { traces } = await getTraces({
  project: { projectName: "my-app" },
  startTime: new Date(Date.now() - 24 * 60 * 60 * 1000),
  includeSpans: true,
  limit: 100,
});

Uploading Evaluations as Annotations

Python

from phoenix.evals import evaluate_dataframe
from phoenix.evals.utils import to_annotation_dataframe

# Run evaluations
results_df = evaluate_dataframe(dataframe=spans_df, evaluators=[my_eval])

# Format results for Phoenix annotations
annotations_df = to_annotation_dataframe(results_df)

# Upload to Phoenix
client.spans.log_span_annotations_dataframe(dataframe=annotations_df)

TypeScript

import { logSpanAnnotations } from "@arizeai/phoenix-client/spans";

await logSpanAnnotations({
  spanAnnotations: [
    {
      spanId: "abc123",
      name: "quality",
      label: "good",
      score: 0.95,
      annotatorKind: "LLM",
    },
  ],
});

Annotations are visible in the Phoenix UI alongside your traces.

Verify

Required attributes: input.value, output.value, status_code For RAG: retrieval.documents For agents: tool.name, tool.parameters

Production: Continuous Evaluation

Capability vs regression evals and the ongoing feedback loop.

Two Types of Evals

Type	Pass Rate Target	Purpose	Update
Capability	50-80%	Measure improvement	Add harder cases
Regression	95-100%	Catch breakage	Add fixed bugs

Saturation

When capability evals hit >95% pass rate, they're saturated: 1. Graduate passing cases to regression suite 2. Add new challenging cases to capability suite

Feedback Loop

Production → Sample traffic → Run evaluators → Find failures
    ↑                                              ↓
Deploy  ←  Run CI evals  ←  Create test cases  ←  Error analysis

Implementation

Build a continuous monitoring loop:

1. Sample recent traces at regular intervals (e.g., 100 traces per hour) 2. Run evaluators on sampled traces 3. Log results to Phoenix for tracking 4. Queue concerning results for human review 5. Create test cases from recurring failure patterns

Python

from phoenix.client import Client
from datetime import datetime, timedelta

client = Client()

# 1. Sample recent spans (includes full attributes for evaluation)
spans_df = client.spans.get_spans_dataframe(
    project_identifier="my-app",
    start_time=datetime.now() - timedelta(hours=1),
    root_spans_only=True,
    limit=100,
)

# 2. Run evaluators
from phoenix.evals import evaluate_dataframe

results_df = evaluate_dataframe(
    dataframe=spans_df,
    evaluators=[quality_eval, safety_eval],
)

# 3. Upload results as annotations
from phoenix.evals.utils import to_annotation_dataframe

annotations_df = to_annotation_dataframe(results_df)
client.spans.log_span_annotations_dataframe(dataframe=annotations_df)

TypeScript

import { getSpans } from "@arizeai/phoenix-client/spans";
import { logSpanAnnotations } from "@arizeai/phoenix-client/spans";

// 1. Sample recent spans
const { spans } = await getSpans({
  project: { projectName: "my-app" },
  startTime: new Date(Date.now() - 60 * 60 * 1000),
  parentId: null, // root spans only
  limit: 100,
});

// 2. Run evaluators (user-defined)
const results = await Promise.all(
  spans.map(async (span) => ({
    spanId: span.context.span_id,
    ...await runEvaluators(span, [qualityEval, safetyEval]),
  }))
);

// 3. Upload results as annotations
await logSpanAnnotations({
  spanAnnotations: results.map((r) => ({
    spanId: r.spanId,
    name: "quality",
    score: r.qualityScore,
    label: r.qualityLabel,
    annotatorKind: "LLM" as const,
  })),
});

For trace-level monitoring (e.g., agent workflows), use get_traces/getTraces to identify traces:

# Python: identify slow traces
traces = client.traces.get_traces(
    project_identifier="my-app",
    start_time=datetime.now() - timedelta(hours=1),
    sort="latency_ms",
    order="desc",
    limit=50,
)

// TypeScript: identify slow traces
import { getTraces } from "@arizeai/phoenix-client/traces";

const { traces } = await getTraces({
  project: { projectName: "my-app" },
  startTime: new Date(Date.now() - 60 * 60 * 1000),
  limit: 50,
});

Alerting

Condition	Severity	Action
Regression < 98%	Critical	Page oncall
Capability declining	Warning	Slack notify
Capability > 95% for 7d	Info	Schedule review

Key Principles

Two suites - Capability + Regression always
Graduate cases - Move consistent passes to regression
Track trends - Monitor over time, not just snapshots

Production: Guardrails vs Evaluators

Guardrails block in real-time. Evaluators measure asynchronously.

Key Distinction

Request → [INPUT GUARDRAIL] → LLM → [OUTPUT GUARDRAIL] → Response
                                            │
                                            └──→ ASYNC EVALUATOR (background)

Guardrails

Aspect	Requirement
Timing	Synchronous, blocking
Latency	< 100ms
Purpose	Prevent harm
Type	Code-based (deterministic)

Use for: PII detection, prompt injection, profanity, length limits, format validation.

Evaluators

Aspect	Characteristic
Timing	Async, background
Latency	Can be seconds
Purpose	Measure quality
Type	Can use LLMs

Use for: Helpfulness, faithfulness, tone, completeness, citation accuracy.

Decision

Question	Answer
Must block harmful content?	Guardrail
Measuring quality?	Evaluator
Need LLM judgment?	Evaluator
< 100ms required?	Guardrail
False positives = angry users?	Evaluator

LLM Guardrails: Rarely

Only use LLM guardrails if:

Latency budget > 1s
Error cost >> LLM cost
Low volume
Fallback exists

Key Principle: Guardrails prevent harm (block). Evaluators measure quality (log).

Production: Overview

CI/CD evals vs production monitoring - complementary approaches.

Two Evaluation Modes

Aspect	CI/CD Evals	Production Monitoring
When	Pre-deployment	Post-deployment, ongoing
Data	Fixed dataset	Sampled traffic
Goal	Prevent regression	Detect drift
Response	Block deploy	Alert & analyze

CI/CD Evaluations

from phoenix.client import Client

client = Client()

# Fast, deterministic checks
ci_evaluators = [
    has_required_format,
    no_pii_leak,
    safety_check,
    regression_test_suite,
]

# Small but representative dataset (~100 examples)
client.experiments.run_experiment(dataset=ci_dataset, task=task, evaluators=ci_evaluators)

Set thresholds: regression=0.95, safety=1.0, format=0.98.

Production Monitoring

Python

from phoenix.client import Client
from datetime import datetime, timedelta

client = Client()

# Sample recent traces (last hour)
traces = client.traces.get_traces(
    project_identifier="my-app",
    start_time=datetime.now() - timedelta(hours=1),
    include_spans=True,
    limit=100,
)

# Run evaluators on sampled traffic
for trace in traces:
    results = run_evaluators_async(trace, production_evaluators)
    if any(r["score"] < 0.5 for r in results):
        alert_on_failure(trace, results)

TypeScript

import { getTraces } from "@arizeai/phoenix-client/traces";
import { getSpans } from "@arizeai/phoenix-client/spans";

// Sample recent traces (last hour)
const { traces } = await getTraces({
  project: { projectName: "my-app" },
  startTime: new Date(Date.now() - 60 * 60 * 1000),
  includeSpans: true,
  limit: 100,
});

// Or sample spans directly for evaluation
const { spans } = await getSpans({
  project: { projectName: "my-app" },
  startTime: new Date(Date.now() - 60 * 60 * 1000),
  limit: 100,
});

// Run evaluators on sampled traffic
for (const span of spans) {
  const results = await runEvaluators(span, productionEvaluators);
  if (results.some((r) => r.score < 0.5)) {
    await alertOnFailure(span, results);
  }
}

Prioritize: errors → negative feedback → random sample.

Feedback Loop

Production finds failure → Error analysis → Add to CI dataset → Prevents future regression

Setup: Python

Packages required for Phoenix evals and experiments.

Installation

# Core Phoenix package (includes client, evals, otel)
pip install arize-phoenix

# Or install individual packages
pip install arize-phoenix-client   # Phoenix client only
pip install arize-phoenix-evals    # Evaluation utilities
pip install arize-phoenix-otel     # OpenTelemetry integration

LLM Providers

For LLM-as-judge evaluators, install your provider's SDK:

pip install openai      # OpenAI
pip install anthropic   # Anthropic
pip install google-generativeai  # Google

Validation (Optional)

pip install scikit-learn  # For TPR/TNR metrics

Quick Verify

from phoenix.client import Client
from phoenix.evals import LLM, ClassificationEvaluator
from phoenix.otel import register

# All imports should work
print("Phoenix Python setup complete")

Key Imports (Evals 2.0)

from phoenix.client import Client
from phoenix.evals import (
    ClassificationEvaluator,      # LLM classification evaluator (preferred)
    LLM,                          # Provider-agnostic LLM wrapper
    async_evaluate_dataframe,     # Batch evaluate a DataFrame (preferred, async)
    evaluate_dataframe,           # Batch evaluate a DataFrame (sync)
    create_evaluator,             # Decorator for code-based evaluators
    create_classifier,            # Factory for LLM classification evaluators
    bind_evaluator,               # Map column names to evaluator params
    Score,                        # Score dataclass
)
from phoenix.evals.utils import to_annotation_dataframe  # Format results for Phoenix annotations

Prefer: ClassificationEvaluator over create_classifier (more parameters/customization). Prefer: async_evaluate_dataframe over evaluate_dataframe (better throughput for LLM evals).

Do NOT use legacy 1.0 imports: OpenAIModel, AnthropicModel, run_evals, llm_classify.

Setup: TypeScript

Packages required for Phoenix evals and experiments.

Installation

# Using npm
npm install @arizeai/phoenix-client @arizeai/phoenix-evals @arizeai/phoenix-otel

# Using pnpm
pnpm add @arizeai/phoenix-client @arizeai/phoenix-evals @arizeai/phoenix-otel

LLM Providers

For LLM-as-judge evaluators, install Vercel AI SDK providers:

npm install ai @ai-sdk/openai      # Vercel AI SDK + OpenAI
npm install @ai-sdk/anthropic      # Anthropic
npm install @ai-sdk/google         # Google

Or use direct provider SDKs:

npm install openai                 # OpenAI direct
npm install @anthropic-ai/sdk      # Anthropic direct

Quick Verify

import { createClient } from "@arizeai/phoenix-client";
import { createClassificationEvaluator } from "@arizeai/phoenix-evals";
import { registerPhoenix } from "@arizeai/phoenix-otel";

// All imports should work
console.log("Phoenix TypeScript setup complete");

Validating Evaluators (Python)

Validate LLM evaluators against human-labeled examples. Target >80% TPR/TNR/Accuracy.

Calculate Metrics

from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(human_labels, evaluator_predictions))

cm = confusion_matrix(human_labels, evaluator_predictions)
tn, fp, fn, tp = cm.ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)
print(f"TPR: {tpr:.2f}, TNR: {tnr:.2f}")

Correct Production Estimates

def correct_estimate(observed, tpr, tnr):
    """Adjust observed pass rate using known TPR/TNR."""
    return (observed - (1 - tnr)) / (tpr - (1 - tnr))

Find Misclassified

# False Positives: Evaluator pass, human fail
fp_mask = (evaluator_predictions == 1) & (human_labels == 0)
false_positives = dataset[fp_mask]

# False Negatives: Evaluator fail, human pass
fn_mask = (evaluator_predictions == 0) & (human_labels == 1)
false_negatives = dataset[fn_mask]

Red Flags

TPR or TNR < 70%
Large gap between TPR and TNR
Kappa < 0.6

Validating Evaluators (TypeScript)

Validate an LLM evaluator against human-labeled examples before deploying it. Target: >80% TPR and >80% TNR.

Roles are inverted compared to a normal task experiment:

Normal experiment	Evaluator validation
Task = agent logic	Task = run the evaluator under test
Evaluator = judge output	Evaluator = exact-match vs human ground truth
Dataset = agent examples	Dataset = golden hand-labeled examples

Golden Dataset

Use a separate dataset name so validation experiments don't mix with task experiments in Phoenix. Store human ground truth in metadata.groundTruthLabel. Aim for ~50/50 balance:

import type { Example } from "@arizeai/phoenix-client/types/datasets";

const goldenExamples: Example[] = [
  { input: { q: "Capital of France?" }, output: { answer: "Paris" },       metadata: { groundTruthLabel: "correct" } },
  { input: { q: "Capital of France?" }, output: { answer: "Lyon" },        metadata: { groundTruthLabel: "incorrect" } },
  { input: { q: "Capital of France?" }, output: { answer: "Major city..." }, metadata: { groundTruthLabel: "incorrect" } },
];

const VALIDATOR_DATASET = "my-app-qa-evaluator-validation"; // separate from task dataset
const POSITIVE_LABEL = "correct";
const NEGATIVE_LABEL = "incorrect";

Validation Experiment

import { createClient } from "@arizeai/phoenix-client";
import { createOrGetDataset, getDatasetExamples } from "@arizeai/phoenix-client/datasets";
import { asExperimentEvaluator, runExperiment } from "@arizeai/phoenix-client/experiments";
import { myEvaluator } from "./myEvaluator.js";

const client = createClient();

const { datasetId } = await createOrGetDataset({ client, name: VALIDATOR_DATASET, examples: goldenExamples });
const { examples } = await getDatasetExamples({ client, dataset: { datasetId } });
const groundTruth = new Map(examples.map((ex) => [ex.id, ex.metadata?.groundTruthLabel as string]));

// Task: invoke the evaluator under test
const task = async (example: (typeof examples)[number]) => {
  const result = await myEvaluator.evaluate({ input: example.input, output: example.output, metadata: example.metadata });
  return result.label ?? "unknown";
};

// Evaluator: exact-match against human ground truth
const exactMatch = asExperimentEvaluator({
  name: "exact-match", kind: "CODE",
  evaluate: ({ output, metadata }) => {
    const expected = metadata?.groundTruthLabel as string;
    const predicted = typeof output === "string" ? output : "unknown";
    return { score: predicted === expected ? 1 : 0, label: predicted, explanation: `Expected: ${expected}, Got: ${predicted}` };
  },
});

const experiment = await runExperiment({
  client, experimentName: `evaluator-validation-${Date.now()}`,
  dataset: { datasetId }, task, evaluators: [exactMatch],
});

// Compute confusion matrix
const runs = Object.values(experiment.runs);
const predicted = new Map((experiment.evaluationRuns ?? [])
  .filter((e) => e.name === "exact-match")
  .map((e) => [e.experimentRunId, e.result?.label ?? null]));

let tp = 0, fp = 0, tn = 0, fn = 0;
for (const run of runs) {
  if (run.error) continue;
  const p = predicted.get(run.id), a = groundTruth.get(run.datasetExampleId);
  if (!p || !a) continue;
  if (a === POSITIVE_LABEL && p === POSITIVE_LABEL) tp++;
  else if (a === NEGATIVE_LABEL && p === POSITIVE_LABEL) fp++;
  else if (a === NEGATIVE_LABEL && p === NEGATIVE_LABEL) tn++;
  else if (a === POSITIVE_LABEL && p === NEGATIVE_LABEL) fn++;
}
const total = tp + fp + tn + fn;
const tpr = tp + fn > 0 ? (tp / (tp + fn)) * 100 : 0;
const tnr = tn + fp > 0 ? (tn / (tn + fp)) * 100 : 0;
console.log(`TPR: ${tpr.toFixed(1)}%  TNR: ${tnr.toFixed(1)}%  Accuracy: ${((tp + tn) / total * 100).toFixed(1)}%`);

Results & Quality Rules

Metric	Target	Low value means
TPR (sensitivity)	>80%	Misses real failures (false negatives)
TNR (specificity)	>80%	Flags good outputs (false positives)
Accuracy	>80%	General weakness

Golden dataset rules: ~50/50 balance · include edge cases · human-labeled only · never mutate (append new versions) · 20–50 examples is enough.

Re-validate when: prompt template changes · judge model changes · criteria updated · production FP/FN spike.

Validation

Validate LLM judges against human labels before deploying. Target >80% agreement.

Requirements

Requirement	Target
Test set size	100+ examples
Balance	~50/50 pass/fail
Accuracy	>80%
TPR/TNR	Both >70%

Metrics

Metric	Formula	Use When
Accuracy	(TP+TN) / Total	General
TPR (Recall)	TP / (TP+FN)	Quality assurance
TNR (Specificity)	TN / (TN+FP)	Safety-critical
Cohen's Kappa	Agreement beyond chance	Comparing evaluators

Quick Validation

from sklearn.metrics import classification_report, confusion_matrix, cohen_kappa_score

print(classification_report(human_labels, evaluator_predictions))
print(f"Kappa: {cohen_kappa_score(human_labels, evaluator_predictions):.3f}")

# Get TPR/TNR
cm = confusion_matrix(human_labels, evaluator_predictions)
tn, fp, fn, tp = cm.ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)

Golden Dataset Structure

golden_example = {
    "input": "What is the capital of France?",
    "output": "Paris is the capital.",
    "ground_truth_label": "correct",
}

Building Golden Datasets

1. Sample production traces (errors, negative feedback, edge cases) 2. Balance ~50/50 pass/fail 3. Expert labels each example 4. Version datasets (never modify existing)

# GOOD - create new version
golden_v2 = golden_v1 + [new_examples]

# BAD - never modify existing
golden_v1.append(new_example)

Warning Signs

All pass or all fail → too lenient/strict
Random results → criteria unclear
TPR/TNR < 70% → needs improvement

Re-Validate When

Prompt template changes
Judge model changes
Criteria changes
Monthly

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

Pick phoenix-evals for qualitative-to-quantitative failure grouping; use automated eval runners when you need scripted pass-fail scoring without taxonomy design.

FAQ

What is the phoenix-evals axial coding process?

phoenix-evals follows four steps: gather open coding notes, group notes with common themes, create actionable category names, and quantify failures per category. Output is a structured YAML failure taxonomy.

What failure dimensions does phoenix-evals cover?

phoenix-evals example taxonomies include content_quality (hallucination, incompleteness), communication (tone, clarity), and context (user_context, retrieved_context) with nested subcategories for counting.

Is Phoenix Evals safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingagentsllmautomation