
Eval Suite Planner
Turn a plain-English description of your agent into a structured eval suite plan—scenarios, methods, signals, and thresholds—before generating tests or running evals.
Overview
Eval-suite-planner is an agent skill most often used in Build (also Ship) that produces a Microsoft-grounded eval suite plan—scenarios, methods, signals, and thresholds—before any test cases are generated.
Install
npx skills add https://github.com/microsoft/eval-guide --skill eval-suite-plannerWhat is this skill?
- Stage 1 (Define) of Microsoft’s 4-stage iterative agent evaluation framework
- Grounded in Eval Scenario Library: 5 business-problem scenario types with 29 sub-scenarios
- 9 capability scenario types with 49 sub-scenarios for coverage mapping
- Selects evaluation methods and quality signals with acceptance thresholds and priority order
- Explicit handoff: run `/eval-generator` next for Stage 2 (Set Baseline & Iterate)
- 4-stage iterative evaluation framework (Define, Set Baseline & Iterate, Systematic Expansion, Operationalize)
- 5 business-problem scenario types with 29 sub-scenarios in the Eval Scenario Library
Adoption & trust: 28 installs on skills.sh; 12 GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You are building an AI agent but lack a structured eval plan that ties business scenarios, capability checks, and acceptance thresholds together.
Who is it for?
Solo builders defining measurable agent quality before investing in test generation, baselines, or CI eval gates.
Skip if: Running eval jobs, scraping production logs, or skipping straight to codegen without deciding what scenarios and signals matter.
When should I use this skill?
Plain-English agent description exists and you need a structured eval suite plan before generating test cases or running evals.
What do I get? / Deliverables
You get a prioritized eval suite plan aligned to MS Learn Stage 1 so you can invoke eval-generator for baseline test cases and iterative runs.
- Structured eval suite plan with scenario types, methods, quality signals, thresholds, and priority order
- Coverage map aligned to Microsoft Eval Scenario Library and MS Learn Define stage
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Build/agent-tooling is the canonical shelf because eval planning is defined while you are still shaping agent behavior and measurement—not after production-only firefighting. Agent-tooling subphase captures Stage 1 (Define) of agent evaluation: what to measure before baseline runs and CI hooks exist.
Where it fits
Draft scenario coverage and eval methods right after agent architecture is sketched.
Translate launch readiness questions into prioritized eval scenarios and pass thresholds.
Re-plan eval expansion when production behavior drifts from baseline capabilities.
How it compares
Planning artifact for agent evaluation—use before test generation instead of one-off “try these 3 prompts” smoke checks.
Common Questions / FAQ
Who is eval-suite-planner for?
Indie developers and small teams building agents who want Microsoft-aligned eval scenario coverage, methods, and thresholds before writing or running tests.
When should I use eval-suite-planner?
In build when designing a new agent’s measurement strategy, in ship/testing before release gates, or whenever you need Stage 1 (Define) complete prior to baseline evals or CI operationalization.
Is eval-suite-planner safe to install?
It plans evals from documentation and libraries rather than exfiltrating data, but review the Security Audits panel on this Prism page and avoid pasting production secrets into agent descriptions used for planning.
Workflow Chain
Then invoke: eval generator
SKILL.md
READMESKILL.md - Eval Suite Planner
## Purpose This skill takes a plain-English description of an agent and produces a structured eval suite plan. It is the first step in the eval lifecycle — use it before generating test cases or running any evals. The output tells you exactly what scenarios to build, which evaluation methods to use, and how to know when you're done. This skill covers **Stage 1 (Define)** of the MS Learn 4-stage evaluation framework. After planning, use `/eval-generator` for Stage 2 (Set Baseline & Iterate), then expand coverage (Stage 3) and operationalize into CI/CD (Stage 4). **Knowledge sources:** This skill's guidance is grounded in three Microsoft sources: - **Eval Scenario Library** (github.com/microsoft/ai-agent-eval-scenario-library) — 5 business-problem scenario types with 29 sub-scenarios, 9 capability scenario types with 49 sub-scenarios, quality signals, and evaluation method selection - **MS Learn agent evaluation documentation** — the 4-stage iterative evaluation framework (Define, Set Baseline & Iterate, Systematic Expansion, Operationalize), 7 test methods, acceptance criteria design, and evaluation categories - **MS Learn evaluation checklist** ([guidance/evaluation-checklist](https://learn.microsoft.com/en-us/microsoft-copilot-studio/guidance/evaluation-checklist)) — a 4-stage checklist template with a [downloadable editable version](https://github.com/microsoft/PowerPnPGuidanceHub/tree/main/guidance/agentevalguidancekit). The checklist defines Stage 3 expansion categories (Foundational core, Agent robustness, Architecture test, Edge cases) and introduces acceptance criteria design ## Instructions When invoked as `/eval-suite-planner <agent description>`, read the description, infer the agent's primary task, key capabilities, and failure modes, then produce the following output in this exact order. Do not ask clarifying questions, do not pad responses, do not hedge. --- ### Step 0 — Match the agent to scenario types Use this routing table (from the Eval Scenario Library's Entry Path A) to identify which business-problem and capability scenario types apply to the described agent: | If the agent... | Business-problem scenarios | Capability scenarios | |---|---|---| | Answers questions from knowledge sources | Information Retrieval (6 sub-scenarios) | Knowledge Grounding + Compliance | | Executes tasks via APIs/connectors | Request Submission (6 sub-scenarios) | Tool Invocations + Safety | | Walks users through troubleshooting | Troubleshooting (6 sub-scenarios) | Knowledge Grounding + Graceful Failure | | Guides through multi-step processes | Process Navigation (6 sub-scenarios) | Trigger Routing + Tone & Quality | | Routes conversations to teams/departments | Triage & Routing (5 sub-scenarios) | Trigger Routing + Graceful Failure | | Handles sensitive data (PII, financial, health) | (add to whichever applies) | Safety + Compliance | | Serves external customers | (add to whichever applies) | Tone & Quality + Safety | | Is about to be updated or republished | (add to whichever applies) | Regression — re-run existing tests after changes | | All agents (always include) | — | Red-Teaming — adversarial robustness testing | Most agents match 1-2 business-problem types and 3-4 capability types. Select the ones that fit and name them explicitly. **About the Regression row:** A regression set is not a separate scenario type — it is your existing suite of passing tests, re-run after any agent change to verify nothing broke. Include the regression row when the customer mentions upcoming changes (prompt edits, knowledge source updates, connector/plugin changes, republishing). When it applies: - Flag that the customer's curre