
Eval Triage And Improvement
Interpret failing agent eval scores, assign root causes, and get a structured triage report with owners and fixes.
Install
npx skills add https://github.com/microsoft/eval-guide --skill eval-triage-and-improvementWhat is this skill?
- Hybrid workflow: gather eval results first, then emit structured triage report
- Covers Stages 2–4 of MS Learn eval framework including CI/CD regression triage
- Triggered by low pass rate, failing test cases, and “why did this fail” language
- Pairs with eval-result-interpreter; this skill emphasizes diagnosis and improvement actions
- Recommends fixes, owners, and patterns across underperforming cases
Adoption & trust: 31 installs on skills.sh; 12 GitHub stars; 3/3 security scanners passed (skills.sh audits).
Recommended Skills
Microsoft Foundrymicrosoft/azure-skills
Azure Aimicrosoft/azure-skills
Azure Hosted Copilot Sdkmicrosoft/azure-skills
Lark Eventlarksuite/cli
Running Claude Code Via Litellm Copilotxixu-me/skills
Setup Matt Pocock Skillsmattpocock/skills
Journey fit
Primary fit
Diagnosing eval failures is the feedback loop that keeps agents shippable and is the natural home under Ship testing, with carryover into production iteration. Triage turns raw pass/fail metrics into actionable QA work—root cause, remediation, and re-run guidance.
Common Questions / FAQ
Is Eval Triage And Improvement safe to install?
skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Eval Triage And Improvement
# Eval Triage & Improvement You help users interpret their agent evaluation results and find actionable next steps to improve. Follow the hybrid workflow: gather eval results first, then generate a structured triage report with root causes, owners, and recommended fixes. This skill serves **Stages 2-4** of the [MS Learn 4-stage evaluation framework](https://learn.microsoft.com/en-us/microsoft-copilot-studio/guidance/evaluation-checklist) — the iterative loop of running evals, diagnosing failures, applying fixes, and re-running. In Stage 4 (Operationalize), this skill helps triage regressions caught by CI/CD eval runs after agent updates. Use the [evaluation checklist template](https://github.com/microsoft/PowerPnPGuidanceHub/tree/main/guidance/agentevalguidancekit) to track your position in the lifecycle. ### When to use this skill vs. eval-result-interpreter These two skills share the same triage framework but serve different modes of work: | Use **eval-triage-and-improvement** when… | Use **eval-result-interpreter** when… | |---|---| | You want **interactive guidance** walking through diagnosis step by step | You have a CSV file or concrete results and want a **one-shot structured report** | | You are in an **ongoing improvement loop** — fixing, re-running, and re-triaging | This is your **first look** at results — you need a verdict and top actions fast | | You need **detailed remediation help** for specific quality signals (e.g., "wrong tool fires — now what?") | You want a **customer-deliverable artifact** (the .docx triage report) | | You have **many failures** (15+) and need help prioritizing which to investigate | The eval run is relatively straightforward (<20 failures) | | You need the playbook worked examples and deeper diagnostic walkthroughs | You need the **activity map / result comparison** tool recommendations inline | **If in doubt:** Start with eval-result-interpreter to get the structured report, then switch to eval-triage-and-improvement if you need interactive help implementing the fixes. ## Workflow ### Step 1: Gather Eval Results Ask the user to share: 1. **Which eval sets ran** and their pass rates (e.g., "Knowledge Grounding: 71%, Safety: 95%") 2. **Specific failing test cases** — the test case ID, sample input, expected value, actual agent response, and eval method 3. **How many times they've run** — is this the first run or have they run multiple times? 4. **What they've already tried** — any fixes attempted so far? If they don't have structured results, help them organize what they have. If they just have a general complaint ("my agent isn't working well"), guide them to run an eval first using the scenario library. ### Step 2: Score Interpretation Use these thresholds to assess readiness: ``` READINESS ASSESSMENT Safety/Compliance < 95% → BLOCK (fix before anything else) Core business < 80% → ITERATE (focus here) Capabilities < threshold → CONDITIONAL SHIP (document gaps) All above threshold → SHIP ``` **Setting thresholds** — don't apply fixed numbers. Derive from risk profile: | Factor | Higher Threshold When... | |--------|------------------------| | Consequence of failure | Financial loss, safety risk, legal exposure | | Frequency of query type | Users trigger this quality signal often | | Fallback availability | No human backup, or slow backup | | Audience | External customers,