
Proofrag
Install when you ship a RAG or LLM app and need golden sets from your docs, LLM-as-judge scoring, retrieval metrics, shareable scorecards, and CI gates.
Overview
proofrag is a plugin marketplace for the Ship phase that evaluates RAG/LLM apps with golden sets, LLM-as-judge, retrieval metrics, scorecards, and CI gates.
What is this marketplace?
- Golden test sets generated from your own documentation corpus
- LLM-as-judge evaluation plus retrieval metrics in one skill bundle
- Shareable scorecard output for stakeholders and iteration reviews
- CI gate support so regressions fail builds (MIT licensed v0.5.2)
- Keywords: rag, llm-as-judge, evaluation, retrieval metrics
- 1 plugin: proofrag version 0.5.2, MIT license
- Marketplace metadata describes golden sets, LLM-as-judge, and scorecards
Community signal: 1 GitHub stars.
What problem does it solve?
RAG demos look fine in chat but nobody measures grounded accuracy, retrieval drift, or regressions before merge.
Who is it for?
Indie builders shipping doc-QA, support bots, or internal copilots who need repeatable evals—not one-off manual spot checks.
Skip if: Static sites with no LLM or retrieval layer, or teams unwilling to maintain golden questions tied to their docs.
What do I get? / Deliverables
After install, you get doc-derived golden sets, judged scores, retrieval metrics, a shareable scorecard, and optional CI failure on quality drops.
- Golden evaluation set derived from your docs
- LLM-as-judge and retrieval metric results
- Shareable scorecard and optional CI gate signal
Plugins in this marketplace
1 plugin — install individually after you add the marketplace.
Recommended Marketplaces
Journey fit
RAG quality proof belongs on the ship/testing shelf because you run it before trusting production answers and in CI—not while brainstorming positioning. Golden-set evaluation, judge models, and retrieval metrics are test harness concerns for LLM apps, matching ship → testing.
How it compares
RAG/LLM evaluation skill marketplace, not a vector DB or embedding provider integration.
Common Questions / FAQ
Who is Proofrag for?
It is for builders running RAG or LLM apps who want golden-set evaluation, LLM-as-judge, and CI gates inside Claude Code.
When should I use Proofrag?
Use it before release and after chunking, embedder, or model changes to catch retrieval and answer-quality regressions.
How do I add Proofrag to my agent?
Add the unshDee/Proofrag Claude marketplace, enable the Proofrag plugin (MIT v0.5.2), and point it at your app and documentation for golden-set generation.