
Eval Faq
Get opinionated answers on AI agent eval methodology—graders, datasets, multi-turn tests, and Microsoft’s eval playbooks—without digging through scattered docs.
Install
npx skills add https://github.com/microsoft/eval-guide --skill eval-faqWhat is this skill?
- Topic-routed fetching from MS Eval Scenario Library, Triage Playbook, and Eval Guidance Kit before answering
- Covers scenario types (business-problem vs capability), dataset design, and criteria writing for agents
- Guidance on grader types, tool-call evaluation, multi-turn flows, and capability vs regression evals
- Supplements Microsoft sources with select industry references where MS docs are thin
- Invoked as /eval-faq with a free-form methodology question
Adoption & trust: 36 installs on skills.sh; 12 GitHub stars; 2/3 security scanners passed (skills.sh audits).
Recommended Skills
Journey fit
Agent evaluation is the canonical shelf in Ship because regression and capability evals gate whether an agent is safe to release; the skill is invoked when you need testing discipline, not when you first sketch an idea. Testing subphase matches eval FAQs about criteria, non-determinism, tool-call grading, and interpreting pass rates before launch.
Common Questions / FAQ
Is Eval Faq safe to install?
skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Eval Faq
## Purpose Answer any question about eval methodology, grader types, dataset design, criteria writing, non-determinism, tool-call evaluation, multi-turn agent evaluation, eval tooling, capability vs. regression evals, and interpreting results — specifically in the context of AI agent evaluation. Guidance is grounded primarily in **Microsoft's agent evaluation documentation** (MS Learn agent evaluation pages, the Eval Scenario Library, the Triage & Improvement Playbook, and the Eval Guidance Kit), supplemented by select industry sources for topics Microsoft does not cover deeply. ## Instructions When invoked as `/eval-faq <question>`, follow this process exactly: ### Step 1 — Fetch authoritative context before answering Use this topic-to-URL routing table to decide what to fetch. Fetch FIRST, then answer. Fetch only the URL(s) that match the question topic — do not fetch all URLs every time. | Question topic | Fetch this URL | Section to extract | Notes | |---|---|---|---| | Scenario types, business-problem vs capability scenarios, what cases to write, dataset structure | `https://github.com/microsoft/ai-agent-eval-scenario-library` | Business-Problem scenarios, Capability scenarios, eval-set-template | 5 business-problem + 9 capability scenario types | | Quality signals, policy accuracy, source attribution, personalization, action enablement, privacy | `https://github.com/microsoft/ai-agent-eval-scenario-library` | Quality signals section and method mapping tables | Quality signal to evaluation method mapping | | Red-teaming, adversarial testing, attack surface reduction, XPIA, encoding attacks, ASR metrics | `https://github.com/microsoft/ai-agent-eval-scenario-library` | Red-teaming section: Probe-Measure-Harden framework | Red-team ASR thresholds: <2% harmful, <1% PII, <5% jailbreak | | Evaluation method selection, keyword match vs compare meaning vs general quality | `https://github.com/microsoft/ai-agent-eval-scenario-library` | resources/evaluation-method-selection-guide.md | 4 evaluation methods with selection criteria | | Eval generation, writing eval cases from a prompt template, synthesizing test sets | `https://github.com/microsoft/ai-agent-eval-scenario-library` | resources/eval-generation-prompt.md | Template for generating eval cases | | Agent profile template, defining agent scope for eval | `https://github.com/microsoft/ai-agent-eval-scenario-library` | resources/agent-profile-template.yaml | Agent profile definition for scoping evals | | Score interpretation, what scores mean, risk-based thresholds, readiness decisions, SHIP/ITERATE/BLOCK | `https://github.com/microsoft/triage-and-improvement-playbook` | Layer 1: Score Interpretation, readiness decision tree | SHIP / ITERATE / BLOCK decision framework | | Failure triage, debugging eval failures, root cause analysis, diagnostic questions | `https://github.com/microsoft/triage-and-improvement-playbook` | Layer 2: Failure Triage, 26 diagnostic questions | 5-question eval verification, 7 eval setup failure sub-types | | Remediation, fixing failures, instruction budget, actions per quality signal | `https://github.com/microsoft/triage-and-improvement-playbook` | Layer 3: Remediation Mapping | Actions mapped to quality signals | | Pattern analysis, cross-signal patterns, trend analysis, concentration analysis | `https://github.com/microsoft/triage-and-improvement-playbook` | Layer 4: Pattern Analysis | 7 cross-signal patterns, trend analysis | | Root cause types, eval setup issue vs agent config vs platform limitation | `https://github.com/microsoft/triage-and-improvement-playbook` | Root Cause Types section | 3 root cause categories with diagnostic flow |