Now liveThe Skillselion MCP - thousands of ranked skills, loaded into your agent mid-task. No install.Get it →

microsoft/eval-guide

5 skills · 171 installs · 595 stars · GitHub

Install

npx skills add https://github.com/microsoft/eval-guide

Skills in this repo

1Eval FaqThe eval-faq skill answers methodology questions about AI agent evaluation with practical, opinionated guidance anchored in Microsoft's agent evaluation ecosystem. It follows a fetch-first workflow that routes question topics to authoritative URLs such as the Eval Scenario Library, Triage and Improvement Playbook, MS Learn Copilot Studio pages, and select industry supplements before synthesizing a concise answer. The canonical spine is the Practical Guidance on Agent Evaluation 10-step playbook covering planning, capability and trust-safety eval sets, pass-rate gates, baseline runs, failure diagnosis, regression suites, optimization loops, and reusable asset promotion. Answer style rules require three to five direct sentences, concrete numbers like twenty to fifty cases or three trials per case, and at most one clarifying question when architecture or risk tier materially changes guidance. Knowledge base sections document scenario types, grader selection, non-determinism handling, rubric refinement, and SHIP versus ITERATE versus BLOCK readiness framing. Triggers include /eval-faq questions about graders, datasets, tool-call evaluation, or interpreting eval results. Use whenever t.39installs 2Eval Result InterpreterThe eval-result-interpreter skill analyzes Copilot Studio evaluation results into a structured triage report grounded in the 10-step playbook Steps 6, 7, and 9 plus Microsoft's Triage and Improvement Playbook. It accepts Copilot Studio CSV exports with question, expectedResponse, actualResponse, test methods, pass or fail results, and grader explanations, plus plain-text summaries or manifest metadata from companion docx or stage JSON files. Output begins with infrastructure health pre-check, then baseline score summary with per-set capability and trust-safety tables preferring manifest gate types and targets. Verdicts follow gate-based SHIP, ITERATE, or BLOCK rules where any missed hard gate blocks ship and aggregate pass rate alone cannot override a failed hard gate. Failure analysis classifies root causes into eval-setup versus agent-quality buckets and maps remediation actions. Triggers include /eval-result-interpreter on CSV paths or pasted eval summaries.38installs 3Eval Triage And ImprovementThe eval-triage-and-improvement skill helps interpret Copilot Studio agent evaluation results and drive fixes using the Practical Guidance 10-step playbook, especially Step 7 iterate to diagnose failures. The workflow gathers eval set pass rates, manifest metadata, failing test cases, rerun history, and prior fix attempts, then assesses readiness against hard gates and soft targets by risk tier. An infrastructure pre-check verifies knowledge indexing, connector health, auth tokens, and published agent version before triaging individual failures. Prioritization orders hard-gate trust failures, high-risk capability misses, lowest-scoring sets, and recurring regressions, sampling three to five cases when fifteen or more fail. Each failure lands in exactly one Step 7 root bucket: eval-setup problem or agent-quality problem with operational subtypes. Conversation evals identify the critical turn first and mark downstream cascade failures. Remediation maps to eval manifest fixes, agent configuration changes, or platform limitation logging with owners and verification eval sets. Use when eval scores need interactive diagnosis and improvement after baseline or rerun results.35installs 4Eval GeneratorThe eval-generator skill produces Generate artifacts for the eval-guide lifecycle: importable Copilot Studio evaluation CSV files with Question and Expected response columns, a customer-ready docx manifest, and eval-setup-guide.docx for testing methods. Primary mode reads the eval-suite planner workbook registry for capability and trust and safety sets. Fallback mode accepts a plain-English agent description with at least six to eight cases including adversarial scenarios. It delivers playbook Steps 2 and 3 eval sets and designs the Step 8 regression partition. Defaults to Single Response evaluation mode for most agents. Use after eval-suite-planner and before running evaluations.31installs 5Eval Suite PlannerThe eval-suite-planner skill produces the Plan artifact for eval-guide: a populated eval-suite workbook with Planning sheet for agent identity, risk tier, owners, gates, and lifecycle stage plus Eval Suite Registry rows for capability dimensions and trust and safety categories with targets, cadence, and provenance. Feeds eval-generator for CSV and docx outputs and later Run and Interpret stages. Use before generating test cases when building systematic Copilot Studio or agent evaluation programs.28installs

Five minutes, every Monday - the tools, releases and tactics for developers.

unsubscribe anytime.