
Eval Result Interpreter
Turn Copilot Studio eval CSVs or summaries into a SHIP, ITERATE, or BLOCK verdict with prioritized fixes.
Install
npx skills add https://github.com/microsoft/eval-guide --skill eval-result-interpreterWhat is this skill?
- SHIP / ITERATE / BLOCK verdict using Microsoft’s Triage & Improvement Playbook 4-layer system
- Root-cause classification and diagnostic triage on Copilot Studio evaluation CSV or pasted summaries
- Prioritized remediation and pattern analysis for failed or flaky agent behaviors
- Maps to MS Learn eval Stages 2–4: baseline iteration, systematic expansion, and post-update regression
- Closes the eval loop: plan → generate → run → interpret
Adoption & trust: 35 installs on skills.sh; 12 GitHub stars; 3/3 security scanners passed (skills.sh audits).
Recommended Skills
Microsoft Foundrymicrosoft/azure-skills
Azure Aimicrosoft/azure-skills
Azure Hosted Copilot Sdkmicrosoft/azure-skills
Lark Eventlarksuite/cli
Running Claude Code Via Litellm Copilotxixu-me/skills
Setup Matt Pocock Skillsmattpocock/skills
Journey fit
Primary fit
Interpretation sits after you run agent evaluations—the natural home in Prism is Ship when you decide whether quality is launch-ready. Testing subphase covers eval runs, pass/fail triage, and regression interpretation from Microsoft’s 4-stage eval framework.
Common Questions / FAQ
Is Eval Result Interpreter safe to install?
skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Eval Result Interpreter
## Purpose This skill takes eval results — a Copilot Studio evaluation CSV file, a pasted summary, or plain-English description of results — and produces a structured triage report. It is the final step in the eval lifecycle: plan → generate → run → **interpret**. The output tells you whether to ship, what broke, why it broke, and what to fix first. This skill serves **Stages 2-4** of the [MS Learn 4-stage evaluation framework](https://learn.microsoft.com/en-us/microsoft-copilot-studio/guidance/evaluation-checklist). In Stage 2 (Set Baseline & Iterate), it interprets your first eval results and guides fixes. In Stage 3 (Systematic Expansion), it identifies coverage gaps worth expanding into. In Stage 4 (Operationalize), it triages regression failures after agent updates. Use the [evaluation checklist template](https://github.com/microsoft/PowerPnPGuidanceHub/tree/main/guidance/agentevalguidancekit) to track which stage you are in and what to interpret next. **Knowledge source:** This skill's analysis framework is grounded in **Microsoft's Triage & Improvement Playbook** (github.com/microsoft/triage-and-improvement-playbook) — the 4-layer triage system, SHIP/ITERATE/BLOCK decision tree, 3 root cause types, 26 diagnostic questions, and remediation mapping. ### When to use this skill vs. eval-triage-and-improvement These two skills share the same triage framework but serve different modes of work: | Use **eval-result-interpreter** when… | Use **eval-triage-and-improvement** when… | |---|---| | You have a CSV file or concrete results and want a **one-shot structured report** | You want **interactive guidance** walking through diagnosis step by step | | This is your **first look** at results — you need a verdict and top actions fast | You are in an **ongoing improvement loop** — fixing, re-running, and re-triaging | | You want a **customer-deliverable artifact** (the .docx triage report) | You need **detailed remediation help** for specific quality signals (e.g., "wrong tool fires — now what?") | | The eval run is relatively straightforward (<20 failures) | You have **many failures** (15+) and need help prioritizing which to investigate | | You need the **activity map / result comparison** tool recommendations inline | You need the playbook worked examples and deeper diagnostic walkthroughs | **If in doubt:** Start with eval-result-interpreter to get the structured report, then switch to eval-triage-and-improvement if you need interactive help implementing the fixes. ## Instructions When invoked as `/eval-result-interpreter <results>`, parse the input and produce the output below. Accept any of these input formats: **Format 1 — Copilot Studio CSV file** (primary) The user provides a file path to a CSV exported from Copilot Studio agent evaluation. The CSV has these columns: | Column | Description | |---|---| | `question` | The test case input sent to the agent | | `expectedResponse` | The expected answer (may be empty for General Quality tests) | | `actualResponse` | The agent's full response | | `testMethodType_1` | The test method used (e.g., GeneralQuality, CompareMeaning, KeywordMatch, ToolUse, ExactMatch, Custom) | | `result_1` | Pass or Fail | | `passingScore_1` | The threshold score (may be empty) | | `explanation_1` | The grader's reasoning for the verdict | A single row may have multiple test methods: `testMethodType_2`, `result_2`, `passingScore_2`, `explanation_2`, etc. When the user provides a file path, read the CSV and parse it. Count Pass/Fail totals and per test method. **Format 2 — Plain-text summary** A pasted pass/fail count, list of failures, or verbal description of results. **Format 3 — Scenario plan reference**