
Skill Creator
Run structured blind A/B skill evaluations and post-hoc winner/loser analysis so you can iteratively improve agent SKILL.md packages with evidence.
Overview
Skill Creator is an agent skill most often used in Build (also Ship) that analyzes blind skill comparison results to explain winners and generate concrete improvements for losing skills.
Install
npx skills add https://github.com/cognitedata/builder-skills --skill skill-creatorWhat is this skill?
- Post-hoc analyzer unblinds comparator results to explain why output A beat B
- Structured inputs: winner/loser paths, transcripts, and comparator JSON for repeatable audits
- Compares SKILL.md structure—clarity, scripts, examples, and edge-case coverage
- Produces improvement suggestions actionable for the losing skill author
- Fits eval/benchmark loops for skill-creator style quality gates
Adoption & trust: 1.4k installs on skills.sh; 4 GitHub stars; 2/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
What problem does it solve?
You ran two skill variants and know which output won, but not which instructions or gaps caused the loss.
Who is it for?
Teams or solo authors maintaining a skills repo who already run blind comparisons and want systematic post-hoc reviews after each tournament.
Skip if: Greenfield feature coding with no SKILL.md eval artifacts, or builders who only need a single ad-hoc prompt with no comparison pipeline.
When should I use this skill?
A blind comparator has chosen winner A or B and you have paths to both skills, transcripts, and comparison JSON for post-hoc analysis.
What do I get? / Deliverables
You get an unblinded analysis JSON with reasons tied to SKILL.md and transcripts, plus prioritized edits for the losing skill before the next eval run.
- Post-hoc analysis JSON saved to the specified output_path
- Actionable improvement list for the losing skill's instructions and examples
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Skill authoring and evaluation are core agent-tooling work in Build—the first place teams formalize how agents behave. Agent-tooling is the canonical shelf for meta workflows that create, compare, and refine skills rather than application features.
Where it fits
After two planning skills compete blind, unblind results to merge the clearer checklist from the winner into the loser.
Turn comparator reasoning into changelog bullets for SKILL.md and referenced helper files.
Gate a skill version bump by documenting why the new SKILL.md beat the previous tag in a structured analysis file.
When users report bad agent behavior, compare last-known-good vs current skill runs using stored transcripts.
How it compares
Evidence-based skill QA workflow—not a general code generator or an MCP integration server.
Common Questions / FAQ
Who is skill-creator for?
Agent skill authors and platform builders who iterate on SKILL.md quality using blind tests, transcripts, and comparator outputs—not app feature developers by default.
When should I use skill-creator?
After a blind comparison in Build while revising skills; during Ship review when checking regressions between skill versions; during Operate-style iteration when production agent behavior traces back to weak skill instructions.
Is skill-creator safe to install?
It expects local paths to skills and transcripts—review what folders you point the agent at. Confirm pass/fail details on the Security Audits panel on this Prism page before enabling automated eval scripts.
SKILL.md
READMESKILL.md - Skill Creator
# Post-hoc Analyzer Agent Analyze blind comparison results to understand WHY the winner won and generate improvement suggestions. ## Role After the blind comparator determines a winner, the Post-hoc Analyzer "unblids" the results by examining the skills and transcripts. The goal is to extract actionable insights: what made the winner better, and how can the loser be improved? ## Inputs You receive these parameters in your prompt: - **winner**: "A" or "B" (from blind comparison) - **winner_skill_path**: Path to the skill that produced the winning output - **winner_transcript_path**: Path to the execution transcript for the winner - **loser_skill_path**: Path to the skill that produced the losing output - **loser_transcript_path**: Path to the execution transcript for the loser - **comparison_result_path**: Path to the blind comparator's output JSON - **output_path**: Where to save the analysis results ## Process ### Step 1: Read Comparison Result 1. Read the blind comparator's output at comparison_result_path 2. Note the winning side (A or B), the reasoning, and any scores 3. Understand what the comparator valued in the winning output ### Step 2: Read Both Skills 1. Read the winner skill's SKILL.md and key referenced files 2. Read the loser skill's SKILL.md and key referenced files 3. Identify structural differences: - Instructions clarity and specificity - Script/tool usage patterns - Example coverage - Edge case handling ### Step 3: Read Both Transcripts 1. Read the winner's transcript 2. Read the loser's transcript 3. Compare execution patterns: - How closely did each follow their skill's instructions? - What tools were used differently? - Where did the loser diverge from optimal behavior? - Did either encounter errors or make recovery attempts? ### Step 4: Analyze Instruction Following For each transcript, evaluate: - Did the agent follow the skill's explicit instructions? - Did the agent use the skill's provided tools/scripts? - Were there missed opportunities to leverage skill content? - Did the agent add unnecessary steps not in the skill? Score instruction following 1-10 and note specific issues. ### Step 5: Identify Winner Strengths Determine what made the winner better: - Clearer instructions that led to better behavior? - Better scripts/tools that produced better output? - More comprehensive examples that guided edge cases? - Better error handling guidance? Be specific. Quote from skills/transcripts where relevant. ### Step 6: Identify Loser Weaknesses Determine what held the loser back: - Ambiguous instructions that led to suboptimal choices? - Missing tools/scripts that forced workarounds? - Gaps in edge case coverage? - Poor error handling that caused failures? ### Step 7: Generate Improvement Suggestions Based on the analysis, produce actionable suggestions for improving the loser skill: - Specific instruction changes to make - Tools/scripts to add or modify - Examples to include - Edge cases to address Prioritize by impact. Focus on changes that would have changed the outcome. ### Step 8: Write Analysis Results Save structured analysis to `{output_path}`. ## Output Format Write a JSON file with this structure: ```json { "comparison_summary": { "winner": "A", "winner_skill": "path/to/winner/skill", "loser_skill": "path/to/loser/skill", "comparator_reasoning": "Brief summary of why comparator chose winner" }, "winner_strengths": [ "Clear step-by-step instructions for handling multi-page documents", "Included validation script that caught formatting errors", "Explicit guidance on fallback behavior when OCR fails" ], "loser_weaknesses": [ "Vague instruction 'process the document appropriately' led to inconsistent behavior", "No script for validation, agent had to improvise and made errors", "No guidance on OCR failure, agent gave up instead of trying alternatives" ], "instruction_following": { "winner": {