
Phoenix Evals
Structure agent failure notes into quantified taxonomies and push labels into Arize Phoenix for eval workflows.
Overview
Phoenix Evals is an agent skill most often used in Ship (also Build, Operate) that groups open coding notes into structured failure taxonomies and records them as Phoenix span annotations.
Install
npx skills add https://github.com/github/awesome-copilot --skill phoenix-evalsWhat is this skill?
- Four-step axial workflow: gather open notes, pattern themes, name categories, quantify counts per bucket
- YAML failure_taxonomy templates for content quality, communication, context, and safety dimensions
- Human span annotations via Phoenix Python Client and TypeScript @arizeai/phoenix-client
- Agent-oriented failure taxonomy structure in SKILL.md for grounding eval rubrics
- Pairs grouping qualitative codes with quantitative failure-per-category counts
- 4-step axial process: gather, pattern, name, quantify
Adoption & trust: 849 installs on skills.sh; 34.6k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have piles of unstructured agent failure notes but no shared taxonomy or counts to prioritize fixes.
Who is it for?
Indie builders running human review on agent traces who already use or plan to use Arize Phoenix for observability.
Skip if: Teams wanting automated pass/fail scoring without human labeling, or projects with no LLM traces to annotate.
When should I use this skill?
You are turning open-ended eval notes into structured failure categories and Phoenix span labels.
What do I get? / Deliverables
You leave with named failure categories, per-category quantities, and annotated spans ready for Phoenix-backed eval dashboards and iteration.
- failure_taxonomy YAML or equivalent
- Span annotations with label and explanation
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Ship/testing because axial coding turns messy eval notes into categories you act on before release. Testing subphase fits qualitative failure labeling and span annotations tied to traces you are reviewing pre-launch.
Where it fits
Cluster pre-release trace reviews into hallucination vs context buckets before sign-off.
Draft a failure_taxonomy YAML while designing a new tool-calling agent.
Re-quantify labeled failures after a prompt change to see which category grew.
How it compares
Use for qualitative taxonomy and labeling on traces, not as a drop-in replacement for automated benchmark harnesses.
Common Questions / FAQ
Who is phoenix-evals for?
Solo and indie builders evaluating Claude Code or similar agents who need consistent failure categories and Phoenix-compatible annotations.
When should I use phoenix-evals?
During Ship testing when reviewing spans, after Build agent-tooling changes when defining rubrics, or in Operate when triaging production failure patterns from open notes.
Is phoenix-evals safe to install?
Treat it as documentation and example API calls; review the Security Audits panel on this Prism page before wiring Phoenix credentials or client code into your repo.
SKILL.md
READMESKILL.md - Phoenix Evals
# Axial Coding Group open-ended notes into structured failure taxonomies. ## Process 1. **Gather** - Collect open coding notes 2. **Pattern** - Group notes with common themes 3. **Name** - Create actionable category names 4. **Quantify** - Count failures per category ## Example Taxonomy ```yaml failure_taxonomy: content_quality: hallucination: [invented_facts, fictional_citations] incompleteness: [partial_answer, missing_key_info] inaccuracy: [wrong_numbers, wrong_dates] communication: tone_mismatch: [too_casual, too_formal] clarity: [ambiguous, jargon_heavy] context: user_context: [ignored_preferences, misunderstood_intent] retrieved_context: [ignored_documents, wrong_context] safety: missing_disclaimers: [legal, medical, financial] ``` ## Add Annotation (Python) ```python from phoenix.client import Client client = Client() client.spans.add_span_annotation( span_id="abc123", annotation_name="failure_category", label="hallucination", explanation="invented a feature that doesn't exist", annotator_kind="HUMAN", sync=True, ) ``` ## Add Annotation (TypeScript) ```typescript import { addSpanAnnotation } from "@arizeai/phoenix-client/spans"; await addSpanAnnotation({ spanAnnotation: { spanId: "abc123", name: "failure_category", label: "hallucination", explanation: "invented a feature that doesn't exist", annotatorKind: "HUMAN", } }); ``` ## Agent Failure Taxonomy ```yaml agent_failures: planning: [wrong_plan, incomplete_plan] tool_selection: [wrong_tool, missed_tool, unnecessary_call] tool_execution: [wrong_parameters, type_error] state_management: [lost_context, stuck_in_loop] error_recovery: [no_fallback, wrong_fallback] ``` ## Transition Matrix (Agents) Shows where failures occur between states: ```python def build_transition_matrix(conversations, states): matrix = defaultdict(lambda: defaultdict(int)) for conv in conversations: if conv["failed"]: last_success = find_last_success(conv) first_failure = find_first_failure(conv) matrix[last_success][first_failure] += 1 return pd.DataFrame(matrix).fillna(0) ``` ## Principles - **MECE** - Each failure fits ONE category - **Actionable** - Categories suggest fixes - **Bottom-up** - Let categories emerge from data # Common Mistakes (Python) Patterns that LLMs frequently generate incorrectly from training data. ## Legacy Model Classes ```python # WRONG from phoenix.evals import OpenAIModel, AnthropicModel model = OpenAIModel(model="gpt-4") # RIGHT from phoenix.evals import LLM llm = LLM(provider="openai", model="gpt-4o") ``` **Why**: `OpenAIModel`, `AnthropicModel`, etc. are legacy 1.0 wrappers in `phoenix.evals.legacy`. The `LLM` class is provider-agnostic and is the current 2.0 API. ## Using run_evals Instead of evaluate_dataframe ```python # WRONG — legacy 1.0 API from phoenix.evals import run_evals results = run_evals(dataframe=df, evaluators=[eval1], provide_explanation=True) # Returns list of DataFrames # RIGHT — current 2.0 API from phoenix.evals import evaluate_dataframe results_df = evaluate_dataframe(dataframe=df, evaluators=[eval1]) # Returns single DataFrame with {name}_score dict columns ``` **Why**: `run_evals` is the legacy 1.0 batch function. `evaluate_dataframe` is the current 2.0 function with a different return format. ## Wrong Result Column Names ```python # WRONG — column doesn't exist score = results_df["relevance"].mean() # WRONG — column exists but contains dicts, not numbers score = results_df["relevance_score"].mean() # RIGHT — extract numeric score from dict scores = results_df["relevance_score"].apply( lambda x: x.get("score", 0.0) if isinstance(x, dict) else 0.0 ) score = scores.mean() ``` **Why**: `evaluate_dataframe` returns columns named `{name}_score` containing Score dicts like `{"name": "...", "score": 1.0, "label": "...", "explanation": "..."}`. ## Deprec