
Ara Rigor Reviewer
Score and critique an Academic Research Artifact (ARA) for epistemic rigor before you treat its claims as shippable evidence.
Overview
ARA Rigor Reviewer is an agent skill most often used in Ship (also Validate, Build) that scores research artifacts on six epistemic dimensions and flags weak claim–evidence links after structural validation.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill ara-rigor-reviewerWhat is this skill?
- Six Level 2 dimensions with 1–5 scoring anchors (e.g. D1 Evidence Relevance, D2 Falsifiability Quality)
- Semantic checks: relevance, type-aware entailment, and evidence sufficiency per claim
- Severity-tagged findings (major vs suggestion) aligned to each check in the inventory
- Explicit separation from Level 1 structural validation (YAML, references, field presence)
- Claim-type-aware expectations (causal, generalization, improvement, descriptive, scoping)
- Six Level 2 review dimensions with 1–5 scoring anchors each
- Semantic review explicitly separated from Level 1 structural validation
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have a polished research artifact with citations, but you cannot tell whether each experiment actually supports its claims or meets falsifiability bar.
Who is it for?
Solo builders maintaining ARA-style research docs with explicit claims, experiments, and references who want a structured rigor pass.
Skip if: Skipping when you only need YAML or reference parsing—that is Level 1, not this semantic reviewer—or when you have no claims/experiments to score.
When should I use this skill?
An ARA has passed structural checks and you need Level 2 semantic scoring over claim–evidence relevance and falsifiability.
What do I get? / Deliverables
You get dimension scores, major and suggestion findings on relevance and entailment, and a clearer list of claim fixes before you ship or cite the work.
- Per-dimension scores (1–5) with anchored rationale
- Major and suggestion findings tied to the check inventory
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Ship → Review because the skill is a semantic quality gate (Level 2) after structural checks, parallel to code review but for research claims. Subphase review fits a dedicated rigor pass over claim–evidence links, falsifiability, and dimension scores—not initial ideation or infra ops.
Where it fits
Before committing to a product direction, you rigor-review whether benchmark claims in your ARA actually entail the market narrative.
After an agent drafts a technical report with experiments, you run dimension scoring to fix weak relevance matches before docs ship.
Pre-launch, you treat the ARA like a review artifact and clear major findings on evidence type mismatches.
How it compares
Use instead of a single-pass “sounds good” chat review when you need dimension-scored epistemic QA on research artifacts.
Common Questions / FAQ
Who is ara-rigor-reviewer for?
Indie builders and agent users who produce structured research outputs (ARA) and need a second-pass rigor review over claims, evidence, and falsifiability—not just formatting.
When should I use ara-rigor-reviewer?
During Validate when scoping whether evidence supports a thesis, during Build when tightening docs or agent-generated research packs, and during Ship review before publishing or productizing conclusions.
Is ara-rigor-reviewer safe to install?
Review the Security Audits panel on this Prism page for this skill’s source repo; the skill itself is a read-and-reason review workflow with no implied destructive ops in the excerpted spec.
SKILL.md
READMESKILL.md - Ara Rigor Reviewer
# Level 2 Review Dimensions — Scoring Anchors and Check Inventory Six dimensions of epistemic quality. All checks are semantic: they require reading comprehension and reasoning over the ARA's content. Structural validation (reference resolution, field presence, YAML parsing) is handled entirely by Level 1. --- ## D1. Evidence Relevance **Question**: Does the cited evidence actually support each claim in substance, not just by reference? ### Checks | Check | What to verify | Finding severity | |-------|---------------|-----------------| | Relevance | Experiment's Setup/Procedure addresses what the claim actually asserts | major | | Type-aware entailment | Experiment design matches claim type (causal→ablation, generalization→heterogeneous, improvement→baseline, descriptive→sampling, scoping→bounds) | major | | Evidence sufficiency | Is a single experiment enough to support this claim, or are multiple needed? | suggestion | ### Scoring Anchors | Score | Description | |-------|-------------| | 5 | Type-appropriate, relevant evidence for every claim; multi-experiment support where needed | | 4 | Evidence relevant for all claims, minor type mismatches | | 3 | Most claim-experiment pairs relevant, 1-2 weak matches | | 2 | Multiple claims where cited experiments don't substantively address the claim | | 1 | Majority of claims cite experiments irrelevant to their statements | --- ## D2. Falsifiability Quality **Question**: Are claims genuinely falsifiable with meaningful, actionable criteria? ### Checks | Check | What to verify | Finding severity | |-------|---------------|-----------------| | Actionability | Could an independent researcher execute this? Specifies what to measure, failure threshold, and conditions? | major | | Non-triviality | Is the criterion more than a tautology? ("If the method doesn't work" = trivial) | major | | Scope match | Does the criterion address the same scope as the Statement? | major | | Independence | Could it be tested without proprietary data or systems? | minor | ### Scoring Anchors | Score | Description | |-------|-------------| | 5 | Every claim has specific, actionable, independently testable criteria matching claim scope | | 4 | Most criteria are strong, 1-2 vague or hard to operationalize | | 3 | Mixed; some actionable, some trivial or scope-mismatched | | 2 | Most criteria trivial, tautological, or scope-mismatched | | 1 | Criteria meaningless across claims | --- ## D3. Scope Calibration **Question**: Do claims assert exactly what their evidence supports — no more, no less? ### Checks | Check | What to verify | Finding severity | |-------|---------------|-----------------| | Over-claiming | Statement uses universal scope while evidence covers narrow conditions | critical if extreme, major if moderate | | Under-claiming | Evidence files or experiment results not captured by any claim | minor | | Assumption explicitness | Key assumptions stated in problem.md or constraints.md | major if unstated assumptions affect validity | | Generalization boundaries | Artifact states what claims do NOT apply to | minor | | Qualifier consistency | Hedging language matches evidence strength | minor | ### Scoring Anchors | Score | Description | |-------|-------------| | 5 | All claims precisely match evidence scope, assumptions explicit, limits stated | | 4 | Well-scoped with minor gaps in assumption documentation | | 3 | Some claims slightly over/under-reach, assumptions partially stated | | 2 | Multiple over-claims or significant undocumented assumptions | | 1 | Pervasive scope mismatch between claims and evidence | --- ## D4. Argument Coherence **Question**: Does the argument follow a coherent path from problem to solution to evidence? ### Checks | Check | What to verify | Finding severity | |-------|---------------|-----------------| | Observation → Gap derivation | Gaps follow logically from observations | major | | Gap → Insight connection | Key insight addresses the identified ga