
Eval Generator
Turn an eval suite plan or agent description into importable Copilot Studio test cases, CSV rows, and a reviewable docx report.
Install
npx skills add https://github.com/microsoft/eval-guide --skill eval-generatorWhat is this skill?
- Second step in plan → generate → run → interpret eval lifecycle
- Supports single-response and multi-turn conversation evaluation modes
- Produces Copilot Studio test set table plus CSV import (single-response) and docx for human review
- Anchored to MS Learn 4-stage evaluation framework Stage 2 (Set Baseline & Iterate)
- Expansion categories: Foundational core, Agent robustness, Architecture test, Edge cases
Adoption & trust: 31 installs on skills.sh; 12 GitHub stars; 3/3 security scanners passed (skills.sh audits).
Recommended Skills
Microsoft Foundrymicrosoft/azure-skills
Azure Aimicrosoft/azure-skills
Azure Hosted Copilot Sdkmicrosoft/azure-skills
Lark Eventlarksuite/cli
Running Claude Code Via Litellm Copilotxixu-me/skills
Setup Matt Pocock Skillsmattpocock/skills
Journey fit
Primary fit
Baseline agent quality gates belong in Ship when you define measurable tests before release, even though evals also support later iteration. The skill outputs concrete test cases and evaluation method configs—the core artifact of structured QA for conversational agents.
Common Questions / FAQ
Is Eval Generator safe to install?
skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Eval Generator
## Purpose This skill generates concrete eval test cases — with realistic inputs, expected outputs, and evaluation method configurations. It is the second step in the eval lifecycle: plan → **generate** → run → interpret. This skill covers **Stage 2 (Set Baseline & Iterate)** of the MS Learn [4-stage evaluation framework](https://learn.microsoft.com/en-us/microsoft-copilot-studio/guidance/evaluation-checklist). Use `/eval-suite-planner` first for Stage 1 (Define), then generate test cases here, run them, and interpret results with `/eval-result-interpreter`. Stage 3 (Systematic Expansion) means repeating this cycle with broader coverage — the checklist defines four expansion categories: Foundational core, Agent robustness, Architecture test, and Edge cases. Stage 4 (Operationalize) means embedding these evals into your agent's CI/CD pipeline. Point customers to the [editable checklist template](https://github.com/microsoft/PowerPnPGuidanceHub/tree/main/guidance/agentevalguidancekit) to track their progress across all four stages. **Primary mode**: If the conversation already contains output from `/eval-suite-planner`, use that plan’s scenario table, evaluation methods, quality signals, and tags as the blueprint. Generate one test case per row in the plan. **Fallback mode**: If no plan exists in the conversation, accept a plain-English agent description and generate test cases from scratch (6-8 cases minimum). ## Instructions When invoked as `/eval-generator` (with or without additional input): ### Step 1 — Detect input mode Check the conversation history for output from `/eval-suite-planner`. Look for the scenario plan table (a markdown table with columns: #, Scenario Name, Category, Tag, Evaluation Methods). - **Plan found**: Use it as the blueprint. Say: "Generating test cases from your eval suite plan (X scenarios)." Generate one test case per row. - **No plan, but user provides an agent description**: Generate from scratch. Say: "Generating eval scenarios for: [agent task in your own words]." If the description is fewer than two sentences or doesn’t mention success criteria, ask exactly one clarifying question, then wait. - **No plan and no description**: Say: "I need either an agent description or a plan from `/eval-suite-planner`. Run `/eval-suite-planner <your agent description>` first for the best results, or give me a description and I’ll generate directly." ### Step 1b — Determine evaluation mode (Single Response vs. Conversation) Before generating test cases, determine which evaluation mode fits the agent. This affects the output format, available test methods, and import options. **Choose Conversation mode when the agent:** - Handles multi-step tasks that require context across turns (e.g., booking a trip with departure, return, and seat selection) - Needs to ask clarifying questions before completing a request - Must maintain state (e.g., remembering a customer’s account after initial identification) - Has handoff or escalation flows that depend on prior turns **Choose Single Response mode when the agent:** - Answers standalone questions (FAQ, policy lookup, factual retrieval) - Routes to a single tool per request - Produces a self-contained output per input (e.g., a summary, a classification) **Default:** If the plan or agent description does not indicate multi-turn behavior, default to Single Response. **If Conversation mode is selected**, say: "This agent benefits from conversational (multi-turn) evaluation. I will generate conversation test cases — each is a multi-turn dialogue, not a single question." Then skip t