
Phoenix Evals
Structure messy eval notes into quantified failure taxonomies and push labels into Arize Phoenix via span annotations.
Overview
Phoenix Evals is an agent skill most often used in Ship (also Grow analytics, Operate iterate) that groups open eval notes into failure taxonomies and records them on Phoenix spans.
Install
npx skills add https://github.com/arize-ai/phoenix --skill phoenix-evalsWhat is this skill?
- Four-step axial coding flow: gather open notes, pattern themes, name categories, quantify failures
- Example YAML failure_taxonomy spanning content_quality, communication, context, and safety
- Python Phoenix Client add_span_annotation for human failure_category labels
- TypeScript @arizeai/phoenix-client/spans addSpanAnnotation parity
- Includes agent-oriented failure taxonomy scaffolding for structured eval review
- Axial coding process is documented as 4 steps: gather, pattern, name, quantify
- Example taxonomy includes 4 top-level groups: content_quality, communication, context, safety
Adoption & trust: 589 installs on skills.sh; 10k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your agent evals produce scattered qualitative notes with no shared categories or counts, so you cannot see which failure mode dominates.
Who is it for?
Indie builders running Phoenix traces who need human-or-review-driven labels aligned to a consistent failure taxonomy.
Skip if: Teams without Phoenix or tracing who only need generic unit tests on deterministic code paths.
When should I use this skill?
When grouping eval notes into structured failure categories and syncing labels to Phoenix spans during LLM or agent quality reviews.
What do I get? / Deliverables
You get a named, quantified failure taxonomy and span annotations in Phoenix so the next prompt, retrieval, or guardrail change targets the top category.
- YAML or structured failure_taxonomy with nested categories
- Span annotations (failure_category labels with explanations) in Phoenix
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Primary shelf is Ship/testing because axial coding and span labels support eval loops and regression triage before and after release. Testing subphase covers LLM/agent output evaluation, taxonomy design, and trace-linked annotations—not production deploy alone.
Where it fits
Cluster backlog review notes from 50 agent runs into hallucination vs incompleteness before a release candidate.
Count failures per taxonomy bucket weekly to decide whether retrieval or tone prompts need investment.
Label production spans with failure_category after a support spike to prioritize guardrail fixes.
How it compares
Complements automated graders—this skill is for structured human axial coding and trace labels, not a one-click pass/fail benchmark runner by itself.
Common Questions / FAQ
Who is phoenix-evals for?
Developers shipping LLM or agent features who use Arize Phoenix and want reproducible failure categories tied to spans.
When should I use phoenix-evals?
During Ship testing after eval sessions; in Grow when analyzing quality trends; in Operate iterate when triaging production-like trace failures—especially before retraining prompts or RAG.
Is phoenix-evals safe to install?
Check this page’s Security Audits panel; annotation calls use your Phoenix client credentials and should use least-privilege API keys in CI and local env.
SKILL.md
READMESKILL.md - Phoenix Evals
# Axial Coding Group open-ended notes into structured failure taxonomies. ## Process 1. **Gather** - Collect open coding notes 2. **Pattern** - Group notes with common themes 3. **Name** - Create actionable category names 4. **Quantify** - Count failures per category ## Example Taxonomy ```yaml failure_taxonomy: content_quality: hallucination: [invented_facts, fictional_citations] incompleteness: [partial_answer, missing_key_info] inaccuracy: [wrong_numbers, wrong_dates] communication: tone_mismatch: [too_casual, too_formal] clarity: [ambiguous, jargon_heavy] context: user_context: [ignored_preferences, misunderstood_intent] retrieved_context: [ignored_documents, wrong_context] safety: missing_disclaimers: [legal, medical, financial] ``` ## Add Annotation (Python) ```python from phoenix.client import Client client = Client() client.spans.add_span_annotation( span_id="abc123", annotation_name="failure_category", label="hallucination", explanation="invented a feature that doesn't exist", annotator_kind="HUMAN", sync=True, ) ``` ## Add Annotation (TypeScript) ```typescript import { addSpanAnnotation } from "@arizeai/phoenix-client/spans"; await addSpanAnnotation({ spanAnnotation: { spanId: "abc123", name: "failure_category", label: "hallucination", explanation: "invented a feature that doesn't exist", annotatorKind: "HUMAN", } }); ``` ## Agent Failure Taxonomy ```yaml agent_failures: planning: [wrong_plan, incomplete_plan] tool_selection: [wrong_tool, missed_tool, unnecessary_call] tool_execution: [wrong_parameters, type_error] state_management: [lost_context, stuck_in_loop] error_recovery: [no_fallback, wrong_fallback] ``` ## Transition Matrix (Agents) Shows where failures occur between states: ```python def build_transition_matrix(conversations, states): matrix = defaultdict(lambda: defaultdict(int)) for conv in conversations: if conv["failed"]: last_success = find_last_success(conv) first_failure = find_first_failure(conv) matrix[last_success][first_failure] += 1 return pd.DataFrame(matrix).fillna(0) ``` ## Principles - **MECE** - Each failure fits ONE category - **Actionable** - Categories suggest fixes - **Bottom-up** - Let categories emerge from data # Common Mistakes (Python) Patterns that LLMs frequently generate incorrectly from training data. ## Legacy Model Classes ```python # WRONG from phoenix.evals import OpenAIModel, AnthropicModel model = OpenAIModel(model="gpt-4") # RIGHT from phoenix.evals import LLM llm = LLM(provider="openai", model="gpt-4o") ``` **Why**: `OpenAIModel`, `AnthropicModel`, etc. are legacy 1.0 wrappers in `phoenix.evals.legacy`. The `LLM` class is provider-agnostic and is the current 2.0 API. ## Using run_evals Instead of evaluate_dataframe ```python # WRONG — legacy 1.0 API from phoenix.evals import run_evals results = run_evals(dataframe=df, evaluators=[eval1], provide_explanation=True) # Returns list of DataFrames # RIGHT — current 2.0 API from phoenix.evals import evaluate_dataframe results_df = evaluate_dataframe(dataframe=df, evaluators=[eval1]) # Returns single DataFrame with {name}_score dict columns ``` **Why**: `run_evals` is the legacy 1.0 batch function. `evaluate_dataframe` is the current 2.0 function with a different return format. ## Wrong Result Column Names ```python # WRONG — column doesn't exist score = results_df["relevance"].mean() # WRONG — column exists but contains dicts, not numbers score = results_df["relevance_score"].mean() # RIGHT — extract numeric score from dict scores = results_df["relevance_score"].apply( lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0 ) score = scores.mean() ``` **Why**: `evaluate_dataframe` returns columns named `{name}_score` containing Score dicts like `{"name": "...", "score": 1.0, "label": "...", "explanation": "..."}`. ## Deprec