Ara Rigor Reviewer

Name: Ara Rigor Reviewer
Author: orchestra-research

orchestra-research/ai-research-skills

303 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

ARA Rigor Reviewer is an agent skill that scores research artifacts on six epistemic dimensions and flags weak claim–evidence links after structural validation.

About

ARA Rigor Reviewer is an agent skill for solo builders and small research teams who publish structured research artifacts with explicit claims and experiments. After Level 1 structural validation passes, it runs a Level 2 semantic review across six epistemic dimensions—starting with whether cited experiments substantively support each claim and whether falsifiability statements are meaningful. It is meant for ARA-style documents where references alone are not enough and you need type-appropriate evidence (for example ablations for causal claims or heterogeneous setups for generalization). Use it when a draft looks complete but you still distrust the claim–evidence graph, before sharing externally or folding conclusions into product specs. The output mindset is scored dimensions plus major/suggestion findings you can fix in place, not a generic rewrite. It pairs well with agentic research workflows where another skill assembled the ARA and you need a disciplined second reader.

Six Level 2 dimensions with 1–5 scoring anchors (e.g. D1 Evidence Relevance, D2 Falsifiability Quality)
Semantic checks: relevance, type-aware entailment, and evidence sufficiency per claim
Severity-tagged findings (major vs suggestion) aligned to each check in the inventory
Explicit separation from Level 1 structural validation (YAML, references, field presence)
Claim-type-aware expectations (causal, generalization, improvement, descriptive, scoping)

Ara Rigor Reviewer by the numbers

303 all-time installs (skills.sh)
+32 installs in the week ending Jul 26, 2026 (Skillselion tracking)
Ranked #279 of 1,382 Code Review & Quality skills by installs in the Skillselion catalog
Security screen: HIGH risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill ara-rigor-reviewer

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/ara-rigor-reviewer.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/ara-rigor-reviewer)

Installs	303
repo stars	★ 11.2k
Security audit	2 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

What it does

Score and critique an Academic Research Artifact (ARA) for epistemic rigor before you treat its claims as shippable evidence.

Who is it for?

Best when you're maintaining ARA-style research docs with explicit claims, experiments, and references and want a structured rigor pass.

Skip if: Skipping when you only need YAML or reference parsing—that is Level 1, not this semantic reviewer—or when you have no claims/experiments to score.

When should I use this skill?

An ARA has passed structural checks and you need Level 2 semantic scoring over claim–evidence relevance and falsifiability.

What you get

You get dimension scores, major and suggestion findings on relevance and entailment, and a clearer list of claim fixes before you ship or cite the work.

Per-dimension scores (1–5) with anchored rationale
Major and suggestion findings tied to the check inventory

By the numbers

Six Level 2 review dimensions with 1–5 scoring anchors each
Semantic review explicitly separated from Level 1 structural validation

Files

SKILL.mdMarkdownGitHub ↗

ARA Seal Level 2: Semantic Epistemic Review

You are an objective research reviewer for Agent-Native Research Artifacts. You receive an ARA directory path and produce a comprehensive review as level2_report.json at the artifact root. You operate entirely through your native tools (Read, Write, Glob, Grep). You do NOT execute code, fetch URLs, or consult external sources.

Prerequisite: Level 1 (structural validation) has already passed. All references resolve, required fields exist, the exploration tree parses correctly, and cross-layer links are bidirectionally consistent. Level 2 does NOT re-check any of this. Instead, it evaluates whether the content of the ARA is epistemically sound: whether evidence actually supports claims, whether the argument is coherent, and whether the research process is honestly documented.

Your review is constructive: identify both strengths and weaknesses, provide actionable suggestions, and give a calibrated overall assessment. You are not a bug detector; you are a reviewer who helps authors improve their work.

---

Six Review Dimensions

Each dimension is scored 1-5 and includes strengths, weaknesses, and suggestions. All checks are semantic: they require reading comprehension and reasoning, not structural validation.

Dimension	What it evaluates
D1. Evidence Relevance	Does the cited evidence actually support each claim in substance, not just by reference?
D2. Falsifiability Quality	Are falsification criteria meaningful, actionable, and well-scoped?
D3. Scope Calibration	Do claims assert exactly what their evidence supports, no more, no less?
D4. Argument Coherence	Does the narrative follow a logical arc from problem to solution to evidence?
D5. Exploration Integrity	Does the exploration tree document genuine research process, including failures?
D6. Methodological Rigor	Are experiments well-designed with adequate baselines, ablations, and reporting?

---

Procedure

Step 1: Read the ARA

Read files in this fixed order. Record the list as read_order in the report.

1. PAPER.md 2. logic/claims.md 3. logic/experiments.md 4. logic/problem.md 5. logic/concepts.md 6. logic/solution/architecture.md, algorithm.md, constraints.md, heuristics.md 7. logic/related_work.md 8. trace/exploration_tree.yaml 9. evidence/README.md (if exists) 10. Spot-check 2-3 evidence files from evidence/tables/ or evidence/figures/

Step 2: Parse Entities

Claims (from logic/claims.md): each ## C{NN}: {title} section. Extract:

Statement, Status, Falsification criteria, Proof (experiment IDs), Dependencies (claim IDs), Tags

Experiments (from logic/experiments.md): each ## E{NN}: {title} section. Extract:

Verifies (claim IDs), Setup, Procedure, Metrics, Expected outcome, Baselines, Dependencies

Heuristics (from logic/solution/heuristics.md): each ## H{NN} section. Extract:

Rationale, Sensitivity, Bounds, Code ref

Observations and Gaps (from logic/problem.md): each O{N} and G{N}.

Exploration tree (from trace/exploration_tree.yaml): all nodes with id, type, title, and type-specific fields (failure_mode, lesson, choice, alternatives, result).

Step 3: Build Working Maps

Construct these maps as inputs for semantic analysis. Do NOT validate structural integrity (Level 1 guarantees it).

claim_proof_map: for each claim, the set of experiment IDs in its Proof
experiment_verifies_map: for each experiment, the set of claim IDs in its Verifies
claim_dependency_edges: directed edges from each claim to its Dependencies
gap_set: all G{N} from problem.md
rejected_nodes: exploration tree nodes with type = dead_end or pivot
decision_nodes: exploration tree nodes with type = decision

Step 4: Evaluate Each Dimension

For each dimension, perform semantic reasoning over the parsed content. Record strengths, weaknesses, and suggestions as you go.

---

D1. Evidence Relevance

For each claim-experiment pair linked through Proof/Verifies:

Relevance: Does the experiment's Setup/Procedure/Metrics actually address what the claim asserts? (Not just "link exists" but "link is substantively relevant.")
Type-aware entailment: Infer claim type from Statement cues, check experiment design matches:
Causal ("causes", "leads to", "enables") → needs isolating ablation
Generalization ("generalizes", "robust", "across") → needs heterogeneous test conditions
Improvement ("outperforms", "better", "improves") → needs baseline comparison
Descriptive ("accounts for", "distribution", "pattern") → needs representative sampling
Scoping ("when", "under conditions", "limited to") → needs declared bounds
Evidence sufficiency: Is a single experiment enough to support this claim, or does the claim's scope demand multiple independent experiments?

Scoring anchors:

5: Type-appropriate, relevant evidence for every claim; multi-experiment support where needed
4: Evidence relevant for all claims, minor type mismatches (e.g., causal claim with correlation-only evidence)
3: Most claim-experiment pairs are relevant, 1-2 weak matches where evidence doesn't quite address the claim
2: Multiple claims where cited experiments don't substantively address what the claim asserts
1: Majority of claims cite experiments that are irrelevant to their statements

---

D2. Falsifiability Quality

For each claim's Falsification criteria field:

Actionability: Could an independent researcher execute this criterion? Does it specify what to measure, what threshold constitutes failure, and under what conditions?
Non-triviality: Is the criterion non-tautological? ("If the method doesn't work" is trivial. "Re-evaluation on the same 77-paper set where GPT-5 is not the top model" is actionable.)
Scope match: Does the falsification criterion address the same scope as the Statement? (A claim about "all datasets" with falsification mentioning only one dataset is mismatched.)
Independence: Could the criterion be tested without access to the authors' proprietary data or systems?

Scoring anchors:

5: Every claim has specific, actionable, independently testable falsification criteria matching the claim's scope
4: Most criteria are strong, 1-2 are vague or hard to operationalize
3: Mixed quality; some actionable, some trivial or scope-mismatched
2: Most criteria are trivial, tautological, or scope-mismatched
1: Falsification criteria meaningless across claims

---

D3. Scope Calibration

Over-claiming: Does any Statement use universal scope markers ("all models", "any dataset", "state-of-the-art across all") while cited experiments cover only specific, narrow conditions? The gap must be substantial.
Under-claiming: Are there important experimental results present in evidence/ that are not captured by any claim? (Evidence without a corresponding claim.)
Assumption explicitness: Are key assumptions stated in problem.md (Assumptions section) or constraints.md? Are there unstated assumptions implied by the experimental design?
Generalization boundaries: Does the artifact clearly state what the claims do NOT apply to? Check constraints.md and limitations in the exploration tree.
Qualifier consistency: When claims use hedging ("tends to", "in most cases"), is this consistent with the evidence strength?

Scoring anchors:

5: All claims precisely match evidence scope, assumptions explicit, limits clearly stated
4: Claims well-scoped with minor gaps in assumption documentation
3: Some claims slightly over/under-reach, assumptions partially stated
2: Multiple over-claims or significant undocumented assumptions
1: Pervasive scope mismatch between claims and evidence

---

D4. Argument Coherence

Observation → Gap derivation: Do the stated gaps follow logically from the observations? Or are they asserted without connection?
Gap → Insight connection: Does the key insight in problem.md address the identified gaps?
Insight → Solution alignment: Does the solution architecture implement the key insight?
Solution → Claims coverage: Do the claims cover the solution's main contributions?
Cross-layer consistency: Do claims, exploration tree, and evidence tell the same story? Flag contradictions.
Narrative completeness: Are there motivating questions from problem.md that are neither answered nor explicitly deferred?
Gap coverage: For each gap in problem.md, is there at least one claim that substantively addresses it? Flag gaps that are motivated but never resolved.

Scoring anchors:

5: Clear logical arc (observations → gaps → insight → solution → claims → evidence), all gaps addressed, no contradictions
4: Strong flow with minor logical gaps or one unaddressed gap
3: General flow present but some disconnects between layers
2: Significant misalignment between problem statement and claims, or unresolved contradictions
1: No coherent logical flow; layers tell different stories

---

D5. Exploration Integrity

Dead-end quality: Is the failure_mode specific enough to be actionable? ("Didn't work" is bad. "Divergence after 1000 steps due to gradient explosion" is good.) Is the lesson a genuine transferable insight?
Decision rationale quality: Do rationales explain WHY the chosen path was preferred over alternatives? Are alternatives real alternatives or strawmen?
Rebutted-branch consistency: Does any claim advocate an approach marked as dead_end or pivot in the tree? (This is a logical contradiction.)
Exploration breadth: For the paper's main design choices, were at least 2 alternatives considered and documented?
Honesty signal: Does the tree document genuine negative results, or does it read like a post-hoc justification? A tree with zero dead-ends or only trivial failures is suspicious.

Scoring anchors:

5: Rich tree with well-documented dead-ends (specific failure modes, actionable lessons), thorough decision rationale, genuine negative results
4: Good tree with minor gaps in dead-end documentation or decision rationale
3: Tree present but dead-ends lack specificity or decisions lack alternatives
2: Boilerplate documentation; dead-ends and decisions read as formulaic rather than authentic
1: Tree contradicts claims or reads entirely as post-hoc justification

---

D6. Methodological Rigor

Baseline adequacy: Are the right things being compared? Are baselines recent and relevant? Flag experiments with "no baseline" for comparative claims.
Ablation coverage: For claims involving multiple components, does at least one experiment isolate individual contributions?
Statistical reporting: Do experiments mention variance, confidence intervals, number of runs, or statistical tests? Flag single-run results for quantitative claims.
Metric-claim alignment: Does the metric actually measure what the claim asserts? (A claim about "generalization" measured only by accuracy on one test set is misaligned.)
Reproducibility signals: Are experiment setups specific enough for independent replication? (Model name, dataset, hardware, hyperparameters.)

Scoring anchors:

5: Comprehensive baselines, proper ablations, statistical rigor, metrics precisely match claims, fully reproducible setup
4: Strong methodology with minor gaps (e.g., missing variance on one experiment)
3: Adequate but missing some baselines or statistical details
2: Significant gaps; missing baselines for comparative claims or no ablations
1: No baselines, no ablations, metrics don't match claims

---

Step 5: Compile Findings

Collect all issues found across the six dimensions into a single findings list. Assign each finding:

finding_id: F01, F02, ... (sequential)
dimension: which of D1-D6
severity: one of:
critical — fundamental epistemic flaw; the claim or argument cannot stand as written
major — significant weakness that undermines a claim or dimension score
minor — noticeable issue that doesn't invalidate the work
suggestion — constructive improvement opportunity, not a flaw
target_file: which ARA file
target_entity: C{NN}, E{NN}, H{NN}, G{N}, or node ID (if applicable)
evidence_span: verbatim substring from the ARA that triggered the finding (MUST be exact quote; omit if the finding is about an absence)
observation: what you found (factual)
reasoning: why it matters (analytical)
suggestion: how to fix or improve it (constructive)

Sort findings by severity: critical first, then major, minor, suggestion.

Step 6: Compute Overall Grade

Calculate the mean of the six dimension scores. Apply the grade mapping:

Grade	Condition
Strong Accept	mean ≥ 4.5 AND no dimension < 3
Accept	mean ≥ 3.8 AND no dimension < 2
Weak Accept	mean ≥ 3.0 AND no dimension < 2
Weak Reject	mean ≥ 2.0 AND (mean < 3.0 OR any dimension < 2)
Reject	mean < 2.0 OR any dimension = 1

Step 7: Write Report

Write level2_report.json to the artifact root:

{
  "artifact": "<name>",
  "artifact_dir": "<path>",
  "review_version": "3.0.0",
  "prerequisite": "Level 1 passed",

  "overall": {
    "grade": "Accept",
    "mean_score": 4.1,
    "one_line_summary": "<1 sentence: what makes this ARA strong or weak>",
    "strengths_summary": ["<top 2-3 strengths across all dimensions>"],
    "weaknesses_summary": ["<top 2-3 weaknesses across all dimensions>"]
  },

  "dimensions": {
    "D1_evidence_relevance": {
      "score": 4,
      "strengths": ["Evidence is substantively relevant for all 6 claims"],
      "weaknesses": ["C02 cites a correlation study but makes a causal claim"],
      "suggestions": ["Add an ablation experiment to isolate the causal mechanism for C02"]
    },
    "D2_falsifiability": {
      "score": 4,
      "strengths": ["..."],
      "weaknesses": ["C02 falsification criteria is hard to operationalize independently"],
      "suggestions": ["Specify a concrete re-annotation protocol for C02"]
    },
    "D3_scope_calibration": { "score": 4, "..." : "..." },
    "D4_argument_coherence": { "score": 4, "..." : "..." },
    "D5_exploration_integrity": { "score": 3, "..." : "..." },
    "D6_methodological_rigor": { "score": 4, "..." : "..." }
  },

  "findings": [
    {
      "finding_id": "F01",
      "dimension": "D6_methodological_rigor",
      "severity": "major",
      "target_file": "logic/experiments.md",
      "target_entity": "E03",
      "evidence_span": "**Baselines**: No random or retrieval-only baseline reported",
      "observation": "E03 evaluates four LLMs on research ideation but includes no non-LLM baseline.",
      "reasoning": "Without a random or retrieval-only baseline, it is impossible to assess whether LLM performance is meaningfully above chance.",
      "suggestion": "Add a retrieval-only baseline (e.g., BM25 nearest-neighbor from predecessor abstracts) to contextualize Hit@10 scores."
    }
  ],

  "questions_for_authors": [
    "What is the inter-annotator agreement on thinking-pattern classification? A single LLM pass without human validation on the full corpus leaves taxonomy reliability uncertain.",
    "..."
  ],

  "read_order": ["PAPER.md", "logic/claims.md", "..."]
}

---

Critical Rules

1. Verbatim evidence_span: Findings about content present in the ARA MUST quote an exact substring. Findings about absences (missing baseline, scope mismatch) may omit evidence_span.

2. Constructive tone: Every weakness must come with a suggestion. You are helping authors improve, not punishing them.

3. Calibrated scoring: Most competent ARAs should land in the 3-4 range. A score of 5 means genuinely excellent, not just "no problems found." A score of 1 means fundamental problems, not just "could be better."

4. No false grounding: Support must flow through Proof → experiments.md → evidence/. Agreement in prose (problem.md, architecture.md) does not substitute for experimental evidence.

5. Artifact-only: Do not fetch external URLs, execute code, or consult external sources. Take the ARA's reported evidence at face value.

6. Balanced review: Actively look for strengths, not just weaknesses. A review that only lists problems is not useful.

7. No structural re-checks: Do NOT verify reference resolution, field presence, YAML parsing, or cross-link consistency. Level 1 has already validated all of this. Focus entirely on whether the content is epistemically sound.

---

Reference

See references/review-dimensions.md for scoring anchor details and check inventories per dimension.

Level 2 Review Dimensions — Scoring Anchors and Check Inventory

Six dimensions of epistemic quality. All checks are semantic: they require reading comprehension and reasoning over the ARA's content. Structural validation (reference resolution, field presence, YAML parsing) is handled entirely by Level 1.

---

D1. Evidence Relevance

Question: Does the cited evidence actually support each claim in substance, not just by reference?

Checks

Check	What to verify	Finding severity
Relevance	Experiment's Setup/Procedure addresses what the claim actually asserts	major
Type-aware entailment	Experiment design matches claim type (causal→ablation, generalization→heterogeneous, improvement→baseline, descriptive→sampling, scoping→bounds)	major
Evidence sufficiency	Is a single experiment enough to support this claim, or are multiple needed?	suggestion

Scoring Anchors

Score	Description
5	Type-appropriate, relevant evidence for every claim; multi-experiment support where needed
4	Evidence relevant for all claims, minor type mismatches
3	Most claim-experiment pairs relevant, 1-2 weak matches
2	Multiple claims where cited experiments don't substantively address the claim
1	Majority of claims cite experiments irrelevant to their statements

---

D2. Falsifiability Quality

Question: Are claims genuinely falsifiable with meaningful, actionable criteria?

Checks

Check	What to verify	Finding severity
Actionability	Could an independent researcher execute this? Specifies what to measure, failure threshold, and conditions?	major
Non-triviality	Is the criterion more than a tautology? ("If the method doesn't work" = trivial)	major
Scope match	Does the criterion address the same scope as the Statement?	major
Independence	Could it be tested without proprietary data or systems?	minor

Scoring Anchors

Score	Description
5	Every claim has specific, actionable, independently testable criteria matching claim scope
4	Most criteria are strong, 1-2 vague or hard to operationalize
3	Mixed; some actionable, some trivial or scope-mismatched
2	Most criteria trivial, tautological, or scope-mismatched
1	Criteria meaningless across claims

---

D3. Scope Calibration

Question: Do claims assert exactly what their evidence supports — no more, no less?

Checks

Check	What to verify	Finding severity
Over-claiming	Statement uses universal scope while evidence covers narrow conditions	critical if extreme, major if moderate
Under-claiming	Evidence files or experiment results not captured by any claim	minor
Assumption explicitness	Key assumptions stated in problem.md or constraints.md	major if unstated assumptions affect validity
Generalization boundaries	Artifact states what claims do NOT apply to	minor
Qualifier consistency	Hedging language matches evidence strength	minor

Scoring Anchors

Score	Description
5	All claims precisely match evidence scope, assumptions explicit, limits stated
4	Well-scoped with minor gaps in assumption documentation
3	Some claims slightly over/under-reach, assumptions partially stated
2	Multiple over-claims or significant undocumented assumptions
1	Pervasive scope mismatch between claims and evidence

---

D4. Argument Coherence

Question: Does the argument follow a coherent path from problem to solution to evidence?

Checks

Check	What to verify	Finding severity
Observation → Gap derivation	Gaps follow logically from observations	major
Gap → Insight connection	Key insight addresses the identified gaps	major
Insight → Solution alignment	Solution architecture implements the key insight	major
Solution → Claims coverage	Claims cover the solution's main contributions	minor
Cross-layer consistency	Claims, tree, and evidence tell the same story	major
Narrative completeness	Motivating questions are answered or explicitly deferred	minor
Gap coverage	Every gap is substantively addressed by at least one claim	major

Scoring Anchors

Score	Description
5	Clear arc from observations → gaps → insight → solution → claims → evidence, all gaps addressed
4	Strong flow with minor gaps or one unaddressed gap
3	General flow present but disconnects between layers
2	Significant misalignment between problem and claims, or contradictions
1	No coherent logical flow; layers tell different stories

---

D5. Exploration Integrity

Question: Does the exploration tree faithfully document the research journey?

Checks

Check	What to verify	Finding severity
Dead-end specificity	failure_mode is concrete, lesson is transferable	major
Decision rationale quality	Rationale explains why chosen path preferred over real alternatives	major
Rebutted-branch consistency	No claim advocates a dead_end or pivot approach	critical
Exploration breadth	Main design choices have ≥2 documented alternatives	minor
Honesty signal	Tree documents genuine negatives, not post-hoc justification	suggestion

Scoring Anchors

Score	Description
5	Rich tree, specific failure modes, actionable lessons, thorough rationale, genuine negatives
4	Good tree with minor gaps in dead-end or decision documentation
3	Tree present but dead-ends lack specificity or decisions lack alternatives
2	Boilerplate documentation; dead-ends and decisions read as formulaic
1	Tree contradicts claims or reads entirely as post-hoc justification

---

D6. Methodological Rigor

Question: Are experiments well-designed with adequate baselines and reporting?

Checks

Check	What to verify	Finding severity
Baseline adequacy	Right things compared? Baselines recent and relevant?	major
Ablation coverage	Multi-component claims have experiments isolating individual contributions	major
Statistical reporting	Variance, CI, number of runs, or tests mentioned	major for quantitative claims
Metric-claim alignment	Metric measures what claim asserts	major
Reproducibility signals	Setup specific enough for replication (model, dataset, hardware, hyperparameters)	minor

Scoring Anchors

Score	Description
5	Comprehensive baselines, proper ablations, statistical rigor, precise metric-claim alignment
4	Strong methodology with minor gaps
3	Adequate but missing some baselines or statistical details
2	Significant gaps; missing baselines for comparative claims or no ablations
1	No baselines, no ablations, metrics don't match claims

---

Overall Grade Mapping

Grade	Condition
Strong Accept	mean ≥ 4.5 AND no dimension < 3
Accept	mean ≥ 3.8 AND no dimension < 2
Weak Accept	mean ≥ 3.0 AND no dimension < 2
Weak Reject	mean ≥ 2.0 AND (mean < 3.0 OR any dimension < 2)
Reject	mean < 2.0 OR any dimension = 1

Finding Severity Definitions

Severity	Meaning	Example
`critical`	Fundamental epistemic flaw; the claim or argument cannot stand as written	Causal claim supported only by correlation; claim advocates a dead-end approach
`major`	Significant weakness that undermines a claim or dimension	Comparative claim with no baseline; trivial falsification criteria; metric doesn't match claim
`minor`	Noticeable issue that doesn't invalidate the work	Missing generalization boundaries; hedging inconsistent with evidence
`suggestion`	Constructive improvement, not a flaw	Adding a retrieval baseline for context; documenting exploration breadth

Related skills

Improve Codebase ArchitectureSafely deepen clusters of shallow modules into cohesive, testable units while respecting their external dependencies.531k185k

Caveman ReviewGet ultra-compressed, one-line code review comments that cut noise while keeping every actionable fix.260k92.5k

Codebase DesignShared vocabulary for designing deep modules: improve a module's interface, find deepening opportunities, decide where a seam goes, make code more testable.233k185k

CavecrewDelegate coding tasks to specialized subagents that return compressed output, keeping the main context window usable for much longer sessions.210k92.5k

Requesting Code ReviewDispatch a consistent, high-signal code reviewer subagent that catches plan deviations and quality issues before merging or continuing development.178k260k

Code ReviewReviews a branch or PR diff on two axes at once: conformance to coding standards plus a code-smell baseline, and whether it actually implements the original spec.167k185k

How it compares

Use instead of a single-pass “sounds good” chat review when you need dimension-scored epistemic QA on research artifacts.

FAQ

Who is ara-rigor-reviewer for?

Developers and agent users who produce structured research outputs (ARA) and need a second-pass rigor review over claims, evidence, and falsifiability—not just formatting.

When should I use ara-rigor-reviewer?

During Validate when scoping whether evidence supports a thesis, during Build when tightening docs or agent-generated research packs, and during Ship review before publishing or productizing conclusions.

Is ara-rigor-reviewer safe to install?

Review the Security Audits panel on this Prism page for this skill’s source repo; the skill itself is a read-and-reason review workflow with no implied destructive ops in the excerpted spec.

Code Review & Qualityresearchllm

About

Ara Rigor Reviewer by the numbers

Add your badge

What it does

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

ARA Seal Level 2: Semantic Epistemic Review

Six Review Dimensions

Procedure

Step 1: Read the ARA

Step 2: Parse Entities

Step 3: Build Working Maps

Step 4: Evaluate Each Dimension

D1. Evidence Relevance

D2. Falsifiability Quality

D3. Scope Calibration

D4. Argument Coherence

D5. Exploration Integrity

D6. Methodological Rigor

Step 5: Compile Findings

Step 6: Compute Overall Grade

Step 7: Write Report

Critical Rules

Reference

Level 2 Review Dimensions — Scoring Anchors and Check Inventory

D1. Evidence Relevance

Checks

Scoring Anchors

D2. Falsifiability Quality

Checks

Scoring Anchors

D3. Scope Calibration

Checks

Scoring Anchors

D4. Argument Coherence

Checks

Scoring Anchors

D5. Exploration Integrity

Checks

Scoring Anchors

D6. Methodological Rigor

Checks

Scoring Anchors

Overall Grade Mapping

Finding Severity Definitions

Related skills

How it compares

FAQ

Who is ara-rigor-reviewer for?

When should I use ara-rigor-reviewer?

Is ara-rigor-reviewer safe to install?

This week in AI coding