
microsoft/eval-guide
5 skills161 installs60 starsGitHub
Install
npx skills add https://github.com/microsoft/eval-guideSkills in this repo
1Eval FaqEval FAQ is an agent skill that answers practical questions about how to evaluate AI agents—what scenarios to write, which graders to use, how to handle non-determinism, and how to read results. It is built for solo and indie builders shipping Copilot-style or custom agents who need methodology grounded in Microsoft’s agent evaluation ecosystem, not generic ML benchmark advice. Before answering, the skill fetches only the authoritative URLs that match your question topic, so responses stay tied to the Eval Scenario Library, Triage & Improvement Playbook, and Eval Guidance Kit rather than hallucinated best practices. Use it when you are designing eval suites, debugging flaky agent behavior, or deciding between capability and regression tests. The workflow is research-oriented: you ask one focused question, the skill pulls the right doc sections, then synthesizes actionable guidance. Complexity sits at intermediate to advanced because eval design assumes you already have an agent and some failure modes in mind.36installs2Eval Result Interpretereval-result-interpreter is an agent skill for solo builders and small teams shipping Microsoft Copilot Studio agents who need more than a raw CSV. You feed evaluation results—a file, a summary, or a plain-English description—and the skill applies Microsoft’s Triage & Improvement Playbook to produce a structured report: ship readiness, what broke, likely why, and what to fix first. It aligns with Stages 2 through 4 of the official evaluation checklist (baseline and iterate, expand coverage, operationalize and catch regressions). Use it when baseline numbers look ambiguous, when you are expanding test suites, or when an agent update suddenly fails cases you thought were stable. The output is decision-oriented for indie operators who cannot afford a full QA org but still need a repeatable gate before customers see the agent.35installs3Eval GeneratorEval-generator is a Microsoft eval-guide skill that materializes evaluation plans into runnable test cases for Copilot Studio and related agent workflows. Solo builders and small teams use it after defining what “good” means—typically from eval-suite-planner output or a plain-English agent brief—so testing is not improvised chat prompts. It configures realistic inputs, expected outputs, and evaluation methods for both one-shot replies and multi-turn dialogs, then exports formats humans can review and platforms can import. The skill explicitly sits in Stage 2 of Microsoft’s four-stage checklist, with pointers to systematic expansion and CI/CD operationalization in later stages. It pairs with interpreter and triage skills so you close the loop from baseline coverage to regression-aware shipping.31installs4Eval Triage And ImprovementEval-triage-and-improvement helps solo builders and small teams act on agent evaluation results instead of staring at aggregate scores. After Copilot Studio or compatible eval runs return failures, the skill walks a hybrid workflow: collect results, cluster symptoms, hypothesize root causes, and document remediation with clear ownership. It targets the iterative Stages 2 through 4 of Microsoft’s evaluation checklist—baseline iteration, systematic expansion feedback, and operationalized CI/CD catches when agents change. Use it when pass rates drop, specific cases misfire, or you need “why did this fail” translated into prompt, tool, or topic fixes. It complements eval-result-interpreter by bias toward diagnosis and improvement plans rather than first-pass score narration.31installs5Eval Suite PlannerEval-suite-planner is a Microsoft eval-guide agent skill that produces a concrete eval suite plan from a plain-English description of your agent. It is the first step in the eval lifecycle: you use it before generating test cases or executing any eval runs. The output specifies which scenarios to build, which evaluation methods apply, which quality signals matter, what thresholds define success, and in what order to tackle coverage. Guidance is anchored in Microsoft’s Eval Scenario Library on GitHub, MS Learn’s four-stage evaluation framework (Define, Set Baseline & Iterate, Systematic Expansion, Operationalize), and the evaluation checklist for acceptance criteria and test methods. Solo builders shipping Claude Code or Cursor agents get a structured plan instead of ad-hoc prompt spot-checks. After the plan is approved, invoke eval-generator for baseline test creation and iteration. Later stages expand coverage and operationalize evals into CI/CD—this skill stops at a complete Define-stage blueprint.28installs