Phoenix Cli

Name: Phoenix Cli
Author: arize-ai

arize-ai/phoenix

1.1k installs
10.8k repo stars
Updated July 28, 2026
arize-ai/phoenix

phoenix-cli is an AI agent skill that turns raw Phoenix trace observations and open-coding notes into structured MECE failure taxonomies with counts so developers can prioritize LLM fixes and eval design.

About

phoenix-cli is an Arize Phoenix agent skill for developers debugging instrumented LLM and agent applications who have qualitative failure notes but need quantitative structure. Its axial-coding reference groups open-ended observations—trace notes, span reviews, or open-coding output—into named MECE categories with counts grounded in real traffic rather than invented top-down labels. The workflow pairs with open coding via shared session identifiers and Phoenix CLI commands such as px trace add-note and px span annotate, plus GraphQL API access for session rollups. Reach for phoenix-cli when asking what failure categories exist, which evals to build next, or how to prioritize fixes after traces are already collected. The skill supports trace-, span-, and session-level units of analysis so multi-turn agent trajectory failures roll up correctly before eval design. Recommended flow runs open-coding notes first, then axial grouping, then eval construction for the highest-count failure categories surfaced in Phoenix UI-linked coding sessions.

Groups open-ended observations into named categories grounded in real traces
Reuses the exact coding annotation identifier from open-coding for data continuity
Produces MECE-style breakdowns with category counts for downstream eval and prioritization work
Supports queries such as "what categories of failures do we have" or "what should I build evals for"
Works after open-coding or directly from any set of trace observations

Phoenix Cli by the numbers

1,115 all-time installs (skills.sh)
+85 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #941 of 16,659 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: HIGH risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/arize-ai/phoenix --skill phoenix-cli

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/arize-ai/phoenix/phoenix-cli.svg)](https://skillselion.com/skills/arize-ai/phoenix/phoenix-cli)

Installs	1.1k
repo stars	★ 10.8k
Security audit	2 / 3 scanners passed
Last updated	July 28, 2026
Repository	arize-ai/phoenix ↗

How do you build LLM failure taxonomies from traces?

Turn raw observations, traces, or open-coding notes into structured failure taxonomies with counts that support eval design and fix prioritization.

Who is it for?

ML and agent engineers using Arize Phoenix who have trace observations and need counted failure categories before writing evals.

Skip if: Greenfield apps without Phoenix instrumentation or teams wanting generic bug triage outside LLM trace data.

When should I use this skill?

The user has Phoenix trace notes and asks for failure categories, MECE breakdowns, eval priorities, or axial coding after open coding.

What you get

A MECE failure taxonomy with category counts, annotated Phoenix traces or spans, and prioritized eval targets.

MECE failure taxonomy with counts
Phoenix span or trace annotations
Prioritized eval target list

Files

SKILL.mdMarkdownGitHub ↗

Phoenix CLI

Invocation

px <resource> <action>                          # if installed globally
npx @arizeai/phoenix-cli <resource> <action>    # no install required

The CLI uses singular resource commands with subcommands like list and get:

px trace list
px trace get <trace-id>
px trace annotate <trace-id>
px trace add-note <trace-id>
px trace-annotations delete
px span list
px span annotate <span-id>
px span add-note <span-id>
px span-annotations delete
px session list
px session get <session-id>
px session annotate <session-id>
px session add-note <session-id>
px session-annotations delete
px dataset list
px dataset get <name>
px project list
px project get <name>
px annotation-config list
px auth status
px profile list
px profile show [name]
px profile create <name>
px profile use <name>
px profile edit <name>
px profile delete <name>

Setup

export PHOENIX_HOST=http://localhost:6006
export PHOENIX_PROJECT=my-project
export PHOENIX_API_KEY=your-api-key  # if auth is enabled

Always use --format raw --no-progress when piping to jq.

Quick Reference

Task	Files
Look at sampled traces, spans, or sessions and write specific notes about what went wrong (no taxonomy yet)	references/open-coding
Group those notes into a structured failure taxonomy and quantify what matters	references/axial-coding

Both stages tag every artifact with one shared coding annotation identifier (descriptive shape, e.g. coding-run:chatbot-context-loss-2026-05-06) so the run is queryable, reversible, and viewable as a unit. Pass --identifier <value> explicitly on every px call — shell inheritance is unreliable across agent harnesses. Open coding writes notes via px ... add-note and records a small local JSONL sidecar at .px/coding/<sanitized-identifier>.jsonl; axial coding reads that sidecar as the deterministic handoff and records labels in .px/coding/<sanitized-identifier>-axial.jsonl. Pick the identifier once per run (see references/open-coding.md), then share the Phoenix UI link from the wrap-up section. Revert is opt-in and runs three identifier-bound DELETEs only after explicit user confirmation.

Workflow term vs. server annotation name. The skill prose calls this value the coding annotation identifier (shell-variable hint: CODING_ANNOTATION_IDENTIFIER). The server-side annotation NAME used for the UI filter is unchanged — coding_session_id — for data compatibility with rows already written by previous runs. Don't try to rename the server-side annotation; treat the asymmetry as load-bearing.

Workflows

"What do I do after instrumenting?" / "Where do I focus?" / "What's going wrong?" open-coding → axial-coding → build evals for the top categories.

Reference Categories

Prefix	Description
`references/open-coding`	Free-form notes against sampled traces, spans, or sessions — reach for it whenever the user wants to make sense of LLM traffic but has no failure categories yet. Includes a unit-of-analysis diagnostic so the workflow runs at the level the failure modes actually live at (trace for stateless single-shot calls, session for multi-turn agents, span for mechanical/in-isolation failures).
`references/axial-coding`	Inductive grouping of notes into a MECE taxonomy with counts — reach for it whenever the user has observations and needs categories or eval targets

Auth

px auth status                                # check connection and authentication
px auth status --endpoint http://other:6006   # check a specific endpoint
px auth status --profile staging              # check a named profile's connection

Profiles

Named profiles let you switch between multiple Phoenix instances (local, staging, cloud) without juggling environment variables. Profiles are stored in ~/.px/settings.json (or $XDG_CONFIG_HOME/px/settings.json).

Configuration priority (highest to lowest): CLI flags > env vars > active profile > built-in defaults.

px profile list                              # list all profiles (shows active profile)
px profile show                              # show the active profile's settings
px profile show staging                      # show a named profile's settings
px profile create prod --endpoint https://app.phoenix.arize.com --api-key <key> --activate
px profile create local --endpoint http://localhost:6006 --project my-app
px profile use prod                          # switch the active profile
px profile edit prod                         # open profile JSON in $EDITOR (validates on save)
px profile delete prod --yes                 # delete a profile (--yes skips confirmation)

Use --profile <name> on any command to target a specific profile without changing the active one:

px trace list --profile staging --limit 10 --format raw --no-progress | jq .
px auth status --profile prod

px profile create options: --endpoint <url>, --project <name>, --api-key <key>, --header <key=value> (repeatable), --activate.

Projects

px project list                                            # list all projects (table view)
px project list --format raw --no-progress | jq '.[].name' # project names as JSON
px project get my-project --format raw --no-progress       # single record by exact name
px project get my-project --format raw --no-progress | jq -r '.id'  # extract project id

project get exits with ExitCode.FAILURE (1) on a name miss and writes a StructuredError {error, code: "FAILURE", hint} to stderr in --format json|raw.

Traces

px trace list --limit 20 --format raw --no-progress | jq .
px trace list --last-n-minutes 60 --limit 20 --format raw --no-progress | jq '.[] | select(.status == "ERROR")'
px trace list --since 2025-01-15T00:00:00Z --limit 50 --format raw --no-progress | jq .
px trace list --format raw --no-progress | jq 'sort_by(-.duration) | .[0:5]'
px trace list --include-notes --format raw --no-progress | jq '.[].notes'
px trace get <trace-id> --format raw | jq .
px trace get <trace-id> --format raw | jq '.spans[] | select(.status_code != "OK")'
px trace get <trace-id> --include-notes --format raw | jq '.notes'
px trace annotate <trace-id> --name reviewer --label pass
px trace annotate <trace-id> --name reviewer --score 0.9 --format raw --no-progress
px trace annotate <trace-id> --name reviewer --label pass --identifier "<coding-annotation-id>"  # tag with a coding annotation identifier
px trace add-note <trace-id> --text "needs follow-up"
px trace add-note <trace-id> --text "needs follow-up" --identifier "<coding-annotation-id>"  # tag + upsert on identifier
px trace-annotations delete --identifier "<coding-annotation-id>" --all -y            # nuke every annotation tied to this coding annotation identifier

px <entity>-annotations delete requires --all or both --start-time and --end-time and emits {deleted: true, target, filter} on success.

Trace JSON shape

Trace
  traceId, status ("OK"|"ERROR"), duration (ms), startTime, endTime
  annotations[] (with --include-annotations, excludes note)
    name, result { score, label, explanation }
  notes[] (with --include-notes)
    name="note", result { explanation }
  rootSpan  — top-level span (parent_id: null)
  spans[]
    name, span_kind ("LLM"|"CHAIN"|"TOOL"|"RETRIEVER"|"EMBEDDING"|"AGENT"|"RERANKER"|"GUARDRAIL"|"EVALUATOR"|"UNKNOWN")
    status_code ("OK"|"ERROR"|"UNSET"), parent_id, context.span_id
    notes[] (with --include-notes)
      name="note", result { explanation }
    attributes
      input.value, output.value          — raw input/output
      llm.model_name, llm.provider
      llm.token_count.prompt/completion/total
      llm.token_count.prompt_details.cache_read
      llm.token_count.completion_details.reasoning
      llm.input_messages.{N}.message.role/content
      llm.output_messages.{N}.message.role/content
      llm.invocation_parameters          — JSON string (temperature, etc.)
      exception.message                  — set if span errored

Spans

px span list --limit 20                                    # recent spans (table view)
px span list --last-n-minutes 60 --limit 50                # spans from last hour
px span list --since 2025-01-15T00:00:00Z --limit 50       # spans since a timestamp
px span list --span-kind LLM --limit 10                    # only LLM spans
px span list --status-code ERROR --limit 20                # only errored spans
px span list --name chat_completion --limit 10             # filter by span name
px span list --trace-id <id> --format raw --no-progress | jq .   # all spans for a trace
px span list --parent-id null --limit 10                   # only root spans
px span list --parent-id <span-id> --limit 10              # only children of a span
px span list --include-annotations --limit 10              # include annotation scores
px span list --include-notes --limit 10                    # include span notes
px span list --attribute llm.model_name:gpt-4 --limit 10  # filter by string attribute
px span list --attribute llm.token_count.total:500 --limit 10  # filter by numeric attribute
px span list --attribute 'user.id:"12345"' --limit 10     # force string match for numeric-looking value
px span list --attribute session.id:sess:abc:123 --limit 20  # colon in value OK (split on first colon only)
px span list --attribute llm.model_name:gpt-4 --attribute session.id:abc --limit 10  # AND multiple filters
px span list output.json --limit 100                       # save to JSON file
px span list --format raw --no-progress | jq '.[] | select(.status_code == "ERROR")'
px span annotate <span-id> --name reviewer --label pass
px span annotate <span-id> --name checker --score 1 --annotator-kind CODE
px span annotate <span-id> --name reviewer --label pass --identifier "<coding-annotation-id>"  # tag with a coding annotation identifier
px span add-note <span-id> --text "verified by agent"
px span add-note <span-id> --text "verified by agent" --identifier "<coding-annotation-id>"  # tag + upsert on identifier
px span-annotations delete --identifier "<coding-annotation-id>" --all -y           # nuke every annotation tied to this coding annotation identifier

Span JSON shape

Span
  name, span_kind ("LLM"|"CHAIN"|"TOOL"|"RETRIEVER"|"EMBEDDING"|"AGENT"|"RERANKER"|"GUARDRAIL"|"EVALUATOR"|"UNKNOWN")
  status_code ("OK"|"ERROR"|"UNSET"), status_message
  context.span_id, context.trace_id, parent_id
  start_time, end_time
  attributes
    input.value, output.value          — raw input/output
    llm.model_name, llm.provider
    llm.token_count.prompt/completion/total
    llm.input_messages.{N}.message.role/content
    llm.output_messages.{N}.message.role/content
    llm.invocation_parameters          — JSON string (temperature, etc.)
    exception.message                  — set if span errored
  annotations[] (with --include-annotations, excludes note)
    name, result { score, label, explanation }
  notes[] (with --include-notes)
    name="note", result { explanation }

Sessions

px session list --limit 10 --format raw --no-progress | jq .
px session list --order asc --format raw --no-progress | jq '.[].session_id'
px session list --include-annotations --include-notes --format raw --no-progress | jq '.[].notes'
px session get <session-id> --format raw | jq .
px session get <session-id> --include-annotations --format raw | jq '.session.annotations'
px session get <session-id> --include-notes --format raw | jq '.session.notes'
px session annotate <session-id> --name reviewer --label pass
px session annotate <session-id> --name reviewer --score 0.9 --format raw --no-progress
px session annotate <session-id> --name reviewer --label pass --identifier "<coding-annotation-id>"  # tag with a coding annotation identifier
px session add-note <session-id> --text "verified by agent"
px session add-note <session-id> --text "verified by agent" --identifier "<coding-annotation-id>"  # tag + upsert on identifier
px session-annotations delete --identifier "<coding-annotation-id>" --all -y              # nuke every annotation tied to this coding annotation identifier

Session JSON shape

SessionData
  id, session_id, project_id
  start_time, end_time
  token_count_prompt, token_count_completion, token_count_total  — cumulative across all LLM spans in the session (int, default 0)
  annotations[] (with --include-annotations, excludes note)
    name, result { score, label, explanation }
  notes[] (with --include-notes)
    name="note", result { explanation }
  traces[]
    id, trace_id, start_time, end_time

Datasets / Experiments / Prompts

px dataset list --format raw --no-progress | jq '.[].name'
px dataset get <name> --format raw | jq '.examples[] | {input, output: .expected_output}'
px dataset get <name> --split train --format raw | jq .    # filter by split
px dataset get <name> --version <version-id> --format raw | jq .
px experiment list --dataset <name> --format raw --no-progress | jq '.[] | {id, name, failed_run_count}'
px experiment get <id> --format raw --no-progress | jq '.[] | select(.error != null) | {input, error}'
px prompt list --format raw --no-progress | jq '.[].name'
px prompt get <name> --format text --no-progress   # plain text, ideal for piping to AI

Annotation Configs

px annotation-config list                                           # list all configs (table view)
px annotation-config list --format raw --no-progress | jq '.[].name' # config names as JSON

GraphQL

For ad-hoc queries not covered by the commands above. Output is {"data": {...}}.

px api graphql '{ projectCount datasetCount promptCount evaluatorCount }'
px api graphql '{ projects { edges { node { name traceCount tokenCountTotal } } } }' | jq '.data.projects.edges[].node'
px api graphql '{ datasets { edges { node { name exampleCount experimentCount } } } }' | jq '.data.datasets.edges[].node'
px api graphql '{ evaluators { edges { node { name kind } } } }' | jq '.data.evaluators.edges[].node'
# evaluator kind values: "LLM" | "CODE" | "BUILTIN"
# CODE = server-side code evaluator running in a sandbox; BUILTIN = pre-built server evaluator

# Introspect any type
px api graphql '{ __type(name: "Project") { fields { name type { name } } } }' | jq '.data.__type.fields[]'

Key root fields: projects, datasets, prompts, evaluators, projectCount, datasetCount, promptCount, evaluatorCount, viewer.

Docs

Download Phoenix documentation markdown for local use by coding agents.

px docs fetch                                # fetch default workflow docs to .px/docs
px docs fetch --workflow tracing             # fetch only tracing docs
px docs fetch --workflow tracing --workflow evaluation
px docs fetch --dry-run                      # preview what would be downloaded
px docs fetch --refresh                      # clear .px/docs and re-download
px docs fetch --output-dir ./my-docs         # custom output directory

Key options: --workflow (repeatable, values: tracing, evaluation, datasets, prompts, integrations, sdk, self-hosting, all), --dry-run, --refresh, --output-dir (default .px/docs), --workers (default 10).

Axial Coding

Group open-ended observations into structured failure taxonomies. Axial coding turns notes, trace observations, or open-coding output into named categories with counts, supporting downstream work like eval design and fix prioritization. It works well after open coding, but can start from any set of open-ended observations.

Reach for this whenever the user has observations and needs structure — e.g., "what categories of failures do we have", "what should I build evals for", "how do I prioritize fixes", "group these notes", "MECE breakdown", or any framing that asks for categories or counts grounded in real traces rather than invented top-down.

Coding annotation identifier (reuse the open-coding value)

Reuse the coding annotation identifier chosen in open coding — every annotate call below passes --identifier "$CODING_ANNOTATION_IDENTIFIER" explicitly. In a fresh shell or fresh agent invocation, set CODING_ANNOTATION_IDENTIFIER to the same value (recoverable from the wrap-up UI URL or by listing .px/coding/*.jsonl); don't mint a new id. See open-coding.md#coding-annotation-identifier-pick-this-first for the rationale and the sanitization rule.

Workflow term vs. server annotation name. The skill calls this value the coding annotation identifier; the server annotation NAME used for the UI filter stays coding_session_id for data compatibility. Don't try to rename the server-side key.

CODING_ANNOTATION_IDENTIFIER="coding-run:chatbot-context-loss-2026-05-06"
SLUG=$(echo -n "$CODING_ANNOTATION_IDENTIFIER" | sed 's/[^a-zA-Z0-9_-]/-/g')
NOTES_SIDECAR=".px/coding/${SLUG}.jsonl"
AXIAL_SIDECAR=".px/coding/${SLUG}-axial.jsonl"

Choosing the unit

Open coding's diagnostic in open-coding.md#choosing-the-unit-of-analysis commits to a unit (trace, span, or session). Axial coding inherits that unit by default — if open coding ran at the session level, axial labels will too; same for trace and span.

An axial label can live at a different level than the note that informed it — that's a feature, and it works in every direction:

Trace → span: a trace-level note "answered shipping when asked about returns" can produce a span-level annotation on the retrieval span once a pattern reveals retrieval as the consistent culprit.
Trace → session: a batch of trace-level notes describing single-turn confusion can produce a session-level annotation once you see the pattern is "the agent doesn't track the user's stated context across turns."
Session → trace: a session-level note about cross-turn drift may, on closer reading, attribute to one specific turn where the agent dropped the thread; a trace-level annotation can name that turn.

Whichever level you write the axial label on, write the matching coding_session_id UI-filter annotation on the same entity (see UI-filter annotation below) so the UI link picks it up.

Process

1. Set the coding annotation identifier — set CODING_ANNOTATION_IDENTIFIER to the value used in open coding and re-derive SLUG, NOTES_SIDECAR, AXIAL_SIDECAR (see Coding annotation identifier) 2. Gather — read open-coding notes from $NOTES_SIDECAR (at the unit committed in open coding); no server round-trip 3. Pattern — group notes with common themes 4. Name — create actionable category names 5. Attribute — decide what level each category lives at; an axial label can move up (trace → session) or down (trace → span) from the source note's level to the level the pattern actually implicates 6. Record — px {trace,span,session} annotate ... --name axial_coding_category --label <cat> --identifier "$CODING_ANNOTATION_IDENTIFIER", add/update one JSONL sidecar row for the label, then write the matching coding_session_id UI-filter annotation 7. Quantify — count failures per category from $AXIAL_SIDECAR

Example Taxonomy

failure_taxonomy:
  content_quality:
    hallucination: [invented_facts, fictional_citations]
    incompleteness: [partial_answer, missing_key_info]
    inaccuracy: [wrong_numbers, wrong_dates]

  communication:
    tone_mismatch: [too_casual, too_formal]
    clarity: [ambiguous, jargon_heavy]

  context:
    user_context: [ignored_preferences, misunderstood_intent]
    retrieved_context: [ignored_documents, wrong_context]

  safety:
    missing_disclaimers: [legal, medical, financial]

Reading

1. Gather — read this run's open-coding notes from the sidecar

Open-coding wrote one JSONL line per note to $NOTES_SIDECAR (.px/coding/${SLUG}.jsonl). Read it directly — no server round-trip is needed. Each line has entity_kind, entity_id, note, identifier, and ts. If the same (entity_kind, entity_id) appears more than once, use the newest ts as the current note.

Missing-file behavior. An absent $NOTES_SIDECAR means open coding hasn't run for this coding annotation identifier in this CWD — stop and run open coding first, do not silently treat it as zero notes.

Malformed lines. Each line is independently parseable JSON. If jq reports a parse error, fix or drop that line manually; do not edit other lines.

Notes outside this run. The sidecar only carries notes this CWD wrote. To pull notes another reviewer or earlier run wrote, fetch them via px {trace,span,session} list --include-notes (embeds notes into row output) — the workflow's sidecar is intentionally per-CWD-per-coding-identifier.

2. Group — synthesize categories

Review the note text collected above. Manually identify recurring themes and draft candidate category names. Aim for MECE coverage: each note should fit exactly one category.

3. Record — write axial-coding labels

Write one annotation per entity using px {trace,span,session} annotate, passing --identifier "$CODING_ANNOTATION_IDENTIFIER" explicitly on every call, and record one JSONL row in $AXIAL_SIDECAR so Quantify below can count without a server round-trip. The level can differ from where the source note lives — see Recording below.

4. Quantify — count per category from the axial sidecar

Counts come from $AXIAL_SIDECAR (populated by Record). No server query, no project-wide history mixed in — the sidecar holds exactly the labels this run wrote. Count the current rows by axial_label; if an entity appears more than once, use the newest ts.

Same missing-file and malformed-line rules as $NOTES_SIDECAR: a missing axial sidecar means no labels have been written yet (run Record); malformed lines are line-local — fix or drop, don't edit neighbors.

Recording

Use the matching annotate command for the level the label belongs at — which may differ from where the source note lives (see Choosing the unit). Every call carries --identifier "$CODING_ANNOTATION_IDENTIFIER" and --format raw --no-progress, and is paired with a JSONL row in $AXIAL_SIDECAR.

Axial sidecar JSONL line shape (one per `annotate`):

{"entity_kind":"trace","entity_id":"<trace-id>","annotation_name":"axial_coding_category","axial_label":"<label>","explanation":"<optional explanation>","identifier":"<original identifier value, unsanitized>","ts":"<ISO-8601 UTC>"}

Fields:

entity_kind — "trace", "span", or "session" (matches the annotate subcommand)
entity_id — the entity argument passed to annotate
annotation_name — always "axial_coding_category" for axial labels (the workflow's reserved annotation name)
axial_label — the --label value, verbatim; this is what Quantify groups on
explanation — optional, but include it when the annotate call used --explanation
identifier — the original $CODING_ANNOTATION_IDENTIFIER value, unsanitized; the sanitized form lives only in the filename
ts — ISO-8601 UTC timestamp of the local append

If you revise a label for the same entity under the same coding annotation identifier, either replace that row or append a newer row. When duplicate (entity_kind, entity_id, annotation_name) rows exist, the newest ts is the current label. This matches the server upsert behavior of annotate --identifier.

Minimal trace example:

px trace annotate <trace-id> \
  --name axial_coding_category \
  --label answered_off_topic \
  --explanation "asked about returns; answer covered shipping" \
  --annotator-kind HUMAN \
  --identifier "$CODING_ANNOTATION_IDENTIFIER" \
  --format raw --no-progress

Then add a matching JSONL row to $AXIAL_SIDECAR using the line shape above. For span or session labels, change entity_kind, entity_id, and the px subcommand accordingly.

Accepted flags: --name, --label, --score, --explanation, --annotator-kind (HUMAN, LLM, CODE), --identifier. There is no --sync flag — the CLI passes sync=true itself.

UI-filter annotation

Write a coding_session_id annotation at the same level as the axial label — see open-coding.md#ui-filter-annotation for why the Phoenix UI filter requires a name-based annotation rather than the bare --identifier. If open coding already wrote coding_session_id on the same entity, this call upserts (idempotent). The annotation NAME coding_session_id is unchanged; only the workflow's spoken term is "coding annotation identifier".

# Same level as the axial label above
px trace annotate <trace-id> \
  --name coding_session_id \
  --label "$CODING_ANNOTATION_IDENTIFIER" \
  --identifier "$CODING_ANNOTATION_IDENTIFIER"
# or px span annotate / px session annotate at matching levels

Recording discipline

Axial coding categorizes the entities you took notes on during open coding. Use $NOTES_SIDECAR as the source of candidate entities and write labels only after reading the note text and surrounding trace/span/session context. Do not filter by --status-code ERROR — that captures only spans where Python raised, which excludes most failure modes (hallucination, wrong tone, retrieval miss). See open-coding.md for the full reasoning.

Fallback paths: REST POST /v1/{trace,span,session}_annotations and @arizeai/phoenix-client's addSpanAnnotation / addSessionAnnotation (no addTraceAnnotation is exported today — use REST or px trace annotate). The GraphQL endpoint rejects mutations.

Wrapping up

After axial coding finishes, share the Phoenix UI link with the user. The link points to the project's traces table filtered by the coding_session_id annotation — annotations['coding_session_id'].label == '<coding-annotation-id>'. The UI route /projects/:projectId expects an encoded GraphQL node ID, not a project name — resolve it via px project get:

project_id=$(px project get "$PHOENIX_PROJECT" --format raw --no-progress | jq -r '.id')
encoded=$(python3 -c 'import urllib.parse, sys; print(urllib.parse.quote(sys.argv[1]))' \
  "annotations['coding_session_id'].label == '$CODING_ANNOTATION_IDENTIFIER'")
echo "Phoenix UI: $PHOENIX_HOST/projects/$project_id/traces?filterCondition=$encoded"

If the user wants to discard everything this run produced (open-coding notes, axial-coding labels, and coding_session_id annotations on the server, plus the local sidecars), three identifier-bound deletes handle the server side and one rm handles the local sidecars. Confirm before running — destructive. Each px <entity>-annotations delete call requires --all to authorize the unbounded sweep; --identifier only narrows. Set PHOENIX_CLI_DANGEROUSLY_ENABLE_DELETES=true first if not already exported:

for kind in trace span session; do
  px "$kind-annotations" delete \
    --identifier "$CODING_ANNOTATION_IDENTIFIER" \
    --all -y \
    --format raw --no-progress
done
rm -f "$NOTES_SIDECAR" "$AXIAL_SIDECAR"

Each px <entity>-annotations delete call removes notes, axial-coding labels, and coding_session_id annotations together because they share the underlying annotation table; the rm clears the local sidecars.

Agent Failure Taxonomy

agent_failures:
  planning: [wrong_plan, incomplete_plan]
  tool_selection: [wrong_tool, missed_tool, unnecessary_call]
  tool_execution: [wrong_parameters, type_error]
  state_management: [lost_context, stuck_in_loop]
  error_recovery: [no_fallback, wrong_fallback]

Transition Matrix — jq sketch

To find where failures occur between agent states, identify the last non-error span before each first-error span within a trace. Note: OTel leaves most spans at status_code == "UNSET" and only sets "OK" when code explicitly does so — match != "ERROR" rather than == "OK" so the matrix works on typical OTel data.

px span list --format raw --no-progress | jq '
  group_by(.context.trace_id)
  | map(
      sort_by(.start_time)
      | { trace_id: .[0].context.trace_id,
          last_non_error: map(select(.status_code != "ERROR")) | last | .name,
          first_err:      map(select(.status_code == "ERROR")) | first | .name }
    )
  | [ .[] | select(.first_err != null) ]
  | group_by([.last_non_error, .first_err])
  | map({ transition: "\(.[0].last_non_error) → \(.[0].first_err)", count: length })
  | sort_by(-.count)
'

Use the output to tally which state-to-state transitions are most failure-prone and add them to your taxonomy.

What Makes a Good Category

A useful category is:

Named for the cause, not the symptom ("wrong_tool_selected", not "bad_output")
Tied to a fix — if you can't name a remediation, the category is too vague
Grounded in data — emerged from actual note text, not assumed upfront

Principles

One coding annotation identifier per run — every annotate call and every sidecar line carries $CODING_ANNOTATION_IDENTIFIER, the same value open coding used; never mint a new id mid-run.
Pass `--identifier` explicitly — every px call gets --identifier "$CODING_ANNOTATION_IDENTIFIER"; do not rely on inherited env vars.
Sidecar reads, server writes — Gather and Quantify read $NOTES_SIDECAR and $AXIAL_SIDECAR locally; Record writes to the server and updates the sidecar. If an entity appears more than once, the newest ts wins.
MECE — Each failure fits ONE category.
Actionable — Categories suggest fixes.
Bottom-up — Let categories emerge from data.
UI-filter annotation always paired — never write axial_coding_category without writing the matching coding_session_id annotation; the UI link depends on it.

Open Coding

Free-form note-writing against sampled traces, spans, or sessions, before any taxonomy exists. After you pick a sample at the right unit (see Choosing the unit of analysis), read each one and write a short, specific observation of what went wrong. These raw notes feed axial coding, where they get grouped into named failure categories — and ultimately into eval targets or fix priorities.

Reach for this whenever the user wants to look at LLM traffic without a fixed taxonomy yet — e.g., "what's going wrong with this agent", "I just instrumented my app, where do I start", "review these traces", "the chatbot keeps losing context", "what kinds of mistakes is the model making", "help me make sense of these conversations", or any framing that needs grounded observations before categories.

Choosing the unit of analysis

The right unit — trace, span, or session — depends on the question and the system. Pick deliberately before recording; the choice determines whether you call px trace, px span, or px session throughout, and a wrong default is expensive to undo mid-run.

The unit is about where the failure modes you're investigating actually live:

Trace — one input → one call graph → one output. Right for classifiers, single-shot summarizers, stateless tool-using agents, single-query RAG. Failure modes that live here: wrong answer, malformed output, missed retrieval, bad tool selection within one request.
Span — one operation inside a trace. Right for in-isolation mechanical failures (an exception fired, a tool returned an error response, an output is malformed) or when you can attribute on sight to a specific component. Reach for span when the trace as a whole is fine but one piece inside it is the unit of interest.
Session — a sequence of traces sharing a session.id. Right for multi-turn conversational agents, agents with episodic memory, anything where the failure mode is a trajectory: context loss across turns, drift from the user's stated goal, the agent forgetting a stated preference, repeated user clarifications. These failures don't exist on any single trace; they only exist across traces.

Diagnostic — three signals to read

1. User framing. Tilts session: "conversation", "agent forgot", "drift", "memory", "across turns", "user had to repeat themselves". Tilts trace: "this trace", "this call", "the response was wrong", "wrong output". Tilts span: "exception", "error response", "malformed", "the retrieval failed".

2. Data shape. Probe before the loop. The session id lives at rootSpan.attributes["session.id"] (it is not a top-level field on the trace JSON), and is "" for traces that aren't session-wired — filter both:

   px trace list --limit 200 --format raw --no-progress \
     | jq '
       [ .[] | .rootSpan.attributes["session.id"] // empty | select(. != "") ]
       | { with_session: length,
           distinct_sessions: (group_by(.) | length),
           median_traces_per_session:
             (group_by(.) | map(length) | sort | .[length/2|floor] // 0) }
     '

with_session: 0 → sessions not wired; trace is the grain. median_traces_per_session: 1 → single-trace sessions; still trace. median_traces_per_session: 5+ → sessions are meaningful; session is plausibly right.

3. System type. Open one recent trace and inspect the root span's input. A single user message → one turn or one shot. A message array ([{role: user}, {role: assistant}, ...]) → that's a turn within a longer dialogue; the dialogue lives at the session level.

   px trace get <trace-id> --format raw \
     | jq '.rootSpan.attributes["input.value"] | (try fromjson catch .) | (type, length?)'

Commit out loud, then proceed

State the unit explicitly before recording any note:

"Question: 'the chatbot keeps losing context'. Data: median 7 traces per session, message-array inputs. Recording at the session level; will drop to trace for single-turn observations, span for mechanical failures."

The unit can shift if data demands it — a trace-level investigation that surfaces "the agent never remembers earlier turns" should pivot to session. Record the observation, then refocus the next batch. The unit is a starting hypothesis, not a contract.

Coding annotation identifier (pick this first)

Every artifact this workflow produces — open-coding notes, axial-coding labels, the local sidecar files, and the UI-filter annotation — is tagged with one coding annotation identifier so the run is queryable, revertible, and viewable as a unit. Pick a descriptive, unique identifier before recording any notes. Format suggestion:

coding-run:<short-topic>-<YYYY-MM-DD>

Examples: coding-run:chatbot-context-loss-2026-05-06, coding-run:agent-tool-misuse-q2. Descriptive ids carry meaning for whoever opens the data later — better than an opaque uuid. The coding-run: prefix is a visual convention; the value is the workflow's coding annotation identifier, not a px session id.

Workflow term vs. server annotation name. The skill calls this value the coding annotation identifier. The server-side annotation NAME used for the UI filter is unchanged — coding_session_id — for data compatibility with rows already written. Don't try to rename it.

Pass the identifier explicitly on every px call. A shell variable for readability is fine, but do not rely on shell inheritance — many agent harnesses spawn each command in a fresh subshell, so CODING_ANNOTATION_IDENTIFIER may not propagate.

CODING_ANNOTATION_IDENTIFIER="coding-run:chatbot-context-loss-2026-05-06"

The local sidecar lives at .px/coding/<sanitized-identifier>.jsonl (CWD-relative, matching the .px/docs precedent). Sanitization rule: replace any character not matching [a-zA-Z0-9_-] with - before using the value in the filename — colons, slashes, and other shell-fragile characters get normalized. For CODING_ANNOTATION_IDENTIFIER="coding-run:chatbot-context-loss-2026-05-06" the sidecar path is .px/coding/coding-run-chatbot-context-loss-2026-05-06.jsonl.

Verify this run hasn't already started — uniqueness is a local file check, not a server query:

SLUG=$(echo -n "$CODING_ANNOTATION_IDENTIFIER" | sed 's/[^a-zA-Z0-9_-]/-/g')
SIDECAR=".px/coding/${SLUG}.jsonl"
test ! -f "$SIDECAR" || { echo "Sidecar already exists at $SIDECAR — pick a new identifier or delete the file"; exit 1; }
mkdir -p .px/coding

If $SIDECAR already exists, append a disambiguator (-v2, -dustin, etc.) to CODING_ANNOTATION_IDENTIFIER, re-derive SLUG, and re-check. The agent harness can run open coding and axial coding in independent invocations: each step re-derives SLUG from CODING_ANNOTATION_IDENTIFIER and reads/writes the same file.

Process

1. Pick a coding annotation identifier — choose a descriptive value and verify the sidecar file does not yet exist (see Coding annotation identifier) 2. Pick the unit — work through Choosing the unit of analysis and commit to trace, span, or session 3. Inspect — fetch one entity at the chosen unit (trace / span / session) 4. Read — input, output, exceptions, tool calls, retrieved context, and (at session level) the trajectory across child traces 5. Note — write one specific sentence describing what went wrong (or skip if correct) 6. Record — px {trace,span,session} add-note <id> --text "..." --identifier "$CODING_ANNOTATION_IDENTIFIER" --format raw --no-progress, add/update one JSONL sidecar row for the note, then write the matching UI-filter annotation 7. Iterate — move to the next entity; repeat until the sample is exhausted or saturation hits 8. Hand off — axial coding reads the sidecar directly (no shared shell required); see Wrapping up for the UI link

Inspection

Use px to read context at the unit committed in Choosing the unit:

Trace unit — read one trace's input → tool calls → retrieved context → output as one story.
Span unit — read one operation's input/output and surrounding spans for context.
Session unit — read the sequence of traces in order; the trajectory (turns, retrievals, tool-call patterns across traces) is the data, not any single trace's inputs and outputs.

Don't filter the sample by `--status-code ERROR`. OTel's status_code only flips to ERROR when an instrumentor catches a raised Python exception (network failure, 5xx, parse error). Hallucinations, wrong tone, retrieval misses, and bad tool selection all complete cleanly and arrive as OK or UNSET. Sampling for open coding by --status-code ERROR excludes the population this workflow exists to surface.

# Sample recent traces — the unit of inspection in open coding
px trace list --limit 100 --format raw --no-progress | jq '
  .[] | {trace_id: .traceId, root: .rootSpan.name, status,
         input: .rootSpan.attributes["input.value"],
         output: .rootSpan.attributes["output.value"]}
'

# Trace-level context — all spans in one trace, ordered by start_time
px trace get <trace-id> --format raw | jq '
  .spans | sort_by(.start_time) | map({span_id: .context.span_id, name, status_code,
    input: .attributes["input.value"],
    output: .attributes["output.value"]})
'

# Drill to one span (px span get does not exist; filter via span list)
px span list --trace-id <trace-id> --format raw --no-progress \
  | jq '.[] | select(.context.span_id == "<span-id>")'

# Check existing notes on traces (default) or spans you are about to review
# Notes are stored as annotations with name="note"; use --include-notes (not --include-annotations)
px trace list --include-notes --limit 10 --format raw --no-progress | jq '
  .[] | select((.notes // []) | length > 0)
  | {trace_id: .traceId, notes: [.notes[] | .result.explanation]}
'
# Same shape on spans — swap px trace for px span and use .context.span_id

Always pipe through jq with --format raw --no-progress when scripting.

Recording Notes

Use the add-note command matching the unit committed in Choosing the unit: px trace add-note, px span add-note, or px session add-note. Every call carries an explicit --identifier "$CODING_ANNOTATION_IDENTIFIER" and --format raw --no-progress.

Passing --identifier "$CODING_ANNOTATION_IDENTIFIER" does two things:

Tags the note row with the coding annotation identifier on the server, so the cleanup px <entity>-annotations delete --identifier "$CODING_ANNOTATION_IDENTIFIER" --all sweep removes every artifact this run produced.
Makes the call upsert on (entity_id, name='note', identifier) — re-running open coding on the same entity within the same coding annotation identifier overwrites the prior note instead of appending a second row. (Without --identifier, the server stamps a unique px-{kind}-note:<uuid> and each call appends.)

After every successful add-note, record one JSONL line in $SIDECAR. The sidecar is what axial coding reads — no server round-trip. It is a content handoff, not code: keep it readable, inspect it directly, and use whatever simple tooling is convenient.

Sidecar JSONL line shape (one per `add-note`):

{"entity_kind":"trace","entity_id":"<trace-id>","note":"<text>","identifier":"<original identifier value, unsanitized>","ts":"<ISO-8601 UTC>"}

Fields:

entity_kind — "trace", "span", or "session" (matches the add-note subcommand used)
entity_id — the entity argument passed to add-note (trace id, span id, or session id)
note — the --text value, verbatim
identifier — the original $CODING_ANNOTATION_IDENTIFIER value, unsanitized; the sanitized form lives only in the filename
ts — ISO-8601 UTC timestamp (e.g. 2026-05-08T17:14:09Z) of the local append

If you revise a note for the same entity under the same coding annotation identifier, either replace that row or append a newer row. When duplicate (entity_kind, entity_id) rows exist, the newest ts is the current note. This matches the server upsert behavior of add-note --identifier.

Minimal trace example:

px trace add-note <trace-id> \
  --text "Asked about returns; final answer covered shipping policy instead" \
  --identifier "$CODING_ANNOTATION_IDENTIFIER" \
  --format raw --no-progress

Then add a matching JSONL row to $SIDECAR using the line shape above. For span or session notes, change entity_kind, entity_id, and the px subcommand accordingly.

Bulk auto-tagging by status code (e.g. px span list --status-code ERROR | xargs ... add-note "error") is not open coding — open coding is manual, observation-grounded, and ranges over all failure modes, not just spans where Python raised. Skip the bulk-by-status-code shortcut; it produces fewer, less informative notes than walking traces.

UI-filter annotation

Every entity that receives an open-coding note (or an axial-coding label later) also needs a UI-filter annotation so the Phoenix UI can filter by coding annotation identifier. Phoenix's UI filter language is name-based, not identifier-based — there is no UI primitive for filtering by identifier, so an annotation whose name is the constant coding_session_id and whose label is the coding annotation identifier value is what the wrap-up UI link actually filters on.

The annotation NAME coding_session_id is the load-bearing data key on the server and is unchanged in this rewrite. The skill's workflow term is "coding annotation identifier"; the server key stays coding_session_id for compatibility with rows already written.

Run this once per touched entity, alongside the add-note (and again later when axial coding labels a different entity):

px trace annotate <trace-id> \
  --name coding_session_id \
  --label "$CODING_ANNOTATION_IDENTIFIER" \
  --identifier "$CODING_ANNOTATION_IDENTIFIER"
# or px span annotate / px session annotate at matching levels

The annotation's --identifier matches $CODING_ANNOTATION_IDENTIFIER, so the wrap-up DELETE cleans it up in the same call as the notes and the axial-coding labels.

Fallback write paths (one-line asides):

POST /v1/trace_notes and POST /v1/span_notes and POST /v1/session_notes — accept one {data: {trace_id|span_id|session_id, note, identifier}} per request; the optional identifier field upserts on (entity_id, name='note', identifier) when non-empty.
@arizeai/phoenix-client addTraceNote, addSpanNote, and addSessionNote wrap the same endpoints and accept an optional identifier field on the note object.
The GraphQL endpoint rejects mutations with "Only queries are permitted." — write through px {trace,span,session} add-note or the REST endpoints above.

What Makes a Good Note

Weak note	Why it's weak	Good note	Why it's strong
"Wrong answer"	No observable detail	"Said the store closes at 6pm but policy is 9pm"	Quotes observed vs. correct value
"Bad tone"	Vague judgment	"Used first-name greeting for an enterprise support ticket"	Specifies the context mismatch
"Hallucination"	Labels before observing	"Cited a product feature ('auto-renew') that does not exist in the schema"	Describes what was fabricated
"Retrieval issue"	Category, not observation	"Retrieved docs about shipping when the question was about returns"	States what was retrieved vs. needed
"Model confused"	Opaque	"Answered in Spanish when the user wrote in English"	Observable and reproducible

Write what you saw, not the category you think it belongs to — categorization happens in axial coding. Short prefixes like TONE: or FACTUAL: are a personal shorthand, not a repo convention.

Saturation

Stop writing notes when observations stop being new. Signals:

Repeats — the last 10–15 traces produced notes that describe failures you've already seen.
Paraphrase convergence — you catch yourself writing minor variations of earlier notes.
Skips outnumber notes — most recent traces are correct and need no note.

At saturation, move on to axial coding to group what you have. Continuing past saturation adds traces but not insight. You do not need to annotate every trace — annotating correct ones dilutes signal.

Listing what this run produced

The local sidecar is the handoff record for notes written this run. Inspect it directly. Each line is one note record; if the same entity appears more than once, use the newest ts as the current note. Missing-file behavior: an absent sidecar means open coding has not yet started for this coding annotation identifier; treat that as zero notes, not an error. Malformed lines are line-local: fix or drop the bad line without editing neighbors.

Wrapping up

When the run is done, share the Phoenix UI link with the user. The link filters the project's traces page by the coding_session_id annotation written alongside each note. The UI route /projects/:projectId expects an encoded GraphQL node ID, not a project name — resolve it via px project get:

project_id=$(px project get "$PHOENIX_PROJECT" --format raw --no-progress | jq -r '.id')
encoded=$(python3 -c 'import urllib.parse, sys; print(urllib.parse.quote(sys.argv[1]))' \
  "annotations['coding_session_id'].label == '$CODING_ANNOTATION_IDENTIFIER'")
echo "Phoenix UI: $PHOENIX_HOST/projects/$project_id/traces?filterCondition=$encoded"

If the user wants to discard everything this run produced, three identifier-bound deletes handle the server side and one rm handles the local sidecars. Confirm with the user before running — this is destructive. Each call requires --all (or both --start-time and --end-time) to authorize the sweep; --identifier filters further but never authorizes on its own. Set PHOENIX_CLI_DANGEROUSLY_ENABLE_DELETES=true first if not already exported:

for kind in trace span session; do
  px "$kind-annotations" delete \
    --identifier "$CODING_ANNOTATION_IDENTIFIER" \
    --all -y \
    --format raw --no-progress
done
rm -f "$SIDECAR" ".px/coding/${SLUG}-axial.jsonl"

Each px <entity>-annotations delete call covers notes, structured annotations, and the coding_session_id annotation in one shot because they share the underlying annotation table.

Principles

One coding annotation identifier per run — every server artifact and every sidecar line carries the same $CODING_ANNOTATION_IDENTIFIER; never mint a per-stage id.
Pass `--identifier` explicitly — every px call gets --identifier "$CODING_ANNOTATION_IDENTIFIER"; do not rely on inherited env vars across harness-spawned subshells.
Sidecar is the handoff record for notes — axial coding reads from the local sidecar, not from the server; if an entity appears more than once, the newest ts wins.
Free-form over structured — do not pre-commit to a taxonomy during open coding; categories emerge in axial coding.
Specific over general — quote or paraphrase the observed failure; vague labels ("bad response") carry no signal.
Context before labeling — inspect input, output, and retrieved context before writing any note.
Iterate before categorizing — work through the full sample first; resist grouping while still collecting.
Skip is valid — a correct span needs no note; annotating everything dilutes signal.
Revert is opt-in — the wrap-up DELETE only runs after explicit user confirmation; the default path prints the UI link and stops.

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

Pick this over generic bug triage skills when failures live in Phoenix traces and you need counted MECE categories before writing LLM evals.

FAQ

What does phoenix-cli axial coding produce?

phoenix-cli axial coding turns open-ended Phoenix trace observations into structured MECE failure categories with counts, supporting eval design and fix prioritization grounded in sampled traces.

How does phoenix-cli relate to open coding?

phoenix-cli axial coding follows open coding: developers first add free-form notes with px add-note, then group those notes into taxonomies with px annotate using a shared session identifier.

Which analysis levels does phoenix-cli support?

phoenix-cli supports trace, span, and session units of analysis so failures in stateless calls, mechanical spans, or multi-turn agent trajectories categorize at the correct Phoenix entity level.

Is Phoenix Cli safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingagentsautomationresearch

About

Phoenix Cli by the numbers

Add your badge

How do you build LLM failure taxonomies from traces?

Who is it for?

When should I use this skill?

What you get

Files

Phoenix CLI

Invocation

Setup

Quick Reference

Workflows

Reference Categories

Auth

Profiles

Projects

Traces

Trace JSON shape

Spans

Span JSON shape

Sessions

Session JSON shape

Datasets / Experiments / Prompts

Annotation Configs

GraphQL

Docs

Axial Coding

Coding annotation identifier (reuse the open-coding value)

Choosing the unit

Process

Example Taxonomy

Reading

1. Gather — read this run's open-coding notes from the sidecar

2. Group — synthesize categories

3. Record — write axial-coding labels

4. Quantify — count per category from the axial sidecar

Recording

UI-filter annotation

Recording discipline

Wrapping up

Agent Failure Taxonomy

Transition Matrix — jq sketch

What Makes a Good Category

Principles

Open Coding

Choosing the unit of analysis

Diagnostic — three signals to read

Commit out loud, then proceed

Coding annotation identifier (pick this first)

Process

Inspection

Recording Notes

UI-filter annotation

What Makes a Good Note

Saturation

Listing what this run produced

Wrapping up

Principles

Related skills

How it compares

FAQ

What does phoenix-cli axial coding produce?

How does phoenix-cli relate to open coding?

Which analysis levels does phoenix-cli support?

Is Phoenix Cli safe to install?

This week in AI coding