
Phoenix Observability
Define custom Phoenix LLM evaluators and multi-criteria scoring so you can regression-test agent outputs before and after shipping.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill phoenix-observabilityWhat is this skill?
- Template-based evaluators with OpenAIModel and llm_classify for accuracy, completeness, and clarity
- Custom SCORE 1–5 rails with normalized float scores and explanation fields
- Multi-criteria evaluator loop that scores each criterion independently
- Python patterns for input, output, and reference-grounded classification
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 3/3 security scanners passed (skills.sh audits).
Recommended Skills
Journey fit
Canonical shelf is Ship because the SKILL.md centers on evaluator templates, llm_classify rails, and scored quality gates—work you run while hardening AI features. Testing subphase fits template-based and multi-criteria evaluators that classify outputs against references before release.
Common Questions / FAQ
Is Phoenix Observability safe to install?
skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Phoenix Observability
# Phoenix Advanced Usage Guide ## Custom Evaluators ### Template-Based Evaluators ```python from phoenix.evals import OpenAIModel, llm_classify eval_model = OpenAIModel(model="gpt-4o") # Custom template for specific evaluation CUSTOM_EVAL_TEMPLATE = """ You are evaluating an AI assistant's response. User Query: {input} AI Response: {output} Reference Answer: {reference} Evaluate the response on these criteria: 1. Accuracy: Is the information correct? 2. Completeness: Does it fully answer the question? 3. Clarity: Is it easy to understand? Provide a score from 1-5 and explain your reasoning. Format: SCORE: [1-5]\nREASONING: [explanation] """ def custom_evaluator(input_text, output_text, reference_text): result = llm_classify( model=eval_model, template=CUSTOM_EVAL_TEMPLATE, input=input_text, output=output_text, reference=reference_text, rails=["1", "2", "3", "4", "5"] ) return { "score": float(result.label) / 5.0, "label": result.label, "explanation": result.explanation } ``` ### Multi-Criteria Evaluator ```python from phoenix.evals import OpenAIModel, llm_classify from dataclasses import dataclass from typing import List @dataclass class EvaluationResult: criteria: str score: float label: str explanation: str def multi_criteria_evaluator(input_text, output_text, criteria: List[str]): """Evaluate output against multiple criteria.""" results = [] for criterion in criteria: template = f""" Evaluate the following response for {criterion}. Input: {{input}} Output: {{output}} Is this response good in terms of {criterion}? Answer 'good', 'acceptable', or 'poor'. """ result = llm_classify( model=eval_model, template=template, input=input_text, output=output_text, rails=["good", "acceptable", "poor"] ) score_map = {"good": 1.0, "acceptable": 0.5, "poor": 0.0} results.append(EvaluationResult( criteria=criterion, score=score_map.get(result.label, 0.5), label=result.label, explanation=result.explanation )) return results # Usage results = multi_criteria_evaluator( input_text="What is Python?", output_text="Python is a programming language...", criteria=["accuracy", "completeness", "helpfulness"] ) ``` ### Batch Evaluation with Concurrency ```python from phoenix.evals import run_evals, OpenAIModel from phoenix import Client import asyncio client = Client() eval_model = OpenAIModel(model="gpt-4o") # Get spans to evaluate spans_df = client.get_spans_dataframe( project_name="production", filter_condition="span_kind == 'LLM'", limit=1000 ) # Run evaluations with concurrency control eval_results = run_evals( dataframe=spans_df, evaluators=[ HallucinationEvaluator(eval_model), RelevanceEvaluator(eval_model), ToxicityEvaluator(eval_model) ], provide_explanation=True, concurrency=10 # Control parallel evaluations ) # Log results back to Phoenix client.log_evaluations(eval_results) ``` ## Advanced Experiments ### A/B Testing Prompts ```python from phoenix import Client from phoenix.experiments import run_experiment client = Client() # Define prompt variants PROMPT_A = """ Answer the following question concisely: {question} """ PROMPT_B = """ You are a helpful assistant. Please provide a detailed answer to: {question} Include relevant examples if applicable. """ def create_model_with_prompt(prompt_template): def model_fn(input_data): from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "user", "content": prompt_template.format(**input_data) }] ) retur