
Peer Review
Run structured peer review on scientific manuscripts and preprints to catch statistical misuse, wrong tests, and reporting gaps before submission or publication.
Overview
Peer review is an agent skill most often used in Ship (also Validate, Grow) that surfaces statistical and methodological red flags in scientific manuscripts so solo builders can revise before submission.
Install
npx skills add https://github.com/davila7/claude-code-templates --skill peer-reviewWhat is this skill?
- Catalogs p-value misuse, multiple-testing omissions, and missing effect sizes with concrete identification cues.
- Flags inappropriate parametric tests and assumption violations common in biology and social-science papers.
- Organizes feedback by statistical, design, and reporting categories for constructive reviewer comments.
- Recommends corrections such as FDR/Bonferroni, CIs, and cautious language for non-significant results.
- Reference document for agents emulating methodological peer review, not a journal submission portal.
Adoption & trust: 553 installs on skills.sh; 27.8k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have a draft paper or technical report and no statistician on call to spot p-hacking, wrong tests, or overclaimed null results.
Who is it for?
Indie researchers, ML practitioners writing eval papers, and founders publishing whitepapers who want agent-assisted methodological critique.
Skip if: Replacing human peer review, IRB/ethics sign-off, or primary statistical analysis of raw study data.
When should I use this skill?
When reviewing scientific manuscripts, preprints, or technical reports for statistical and methodological quality before publication or major revision.
What do I get? / Deliverables
You receive categorized review comments and remediation suggestions you can turn into revisions or a response letter before you publish or distribute the work.
- Categorized issue list with severity-style grouping (statistical, design, reporting)
- Actionable reviewer comments and recommended statistical reporting fixes
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Formal critique of methods and claims maps naturally to Ship review—the last quality gate before a manuscript or report goes public. The skill is a review checklist and issue catalog, not a drafting or analysis tool, so it sits on the review subphase shelf first.
Where it fits
Stress-test whether planned analyses and sample-size claims in a study outline will survive reviewer scrutiny before you run expensive experiments.
Generate structured comments on p-values, corrections, and effect sizes in a near-final PDF before arXiv or journal submission.
Re-audit a public technical blog or report after adding new results so outdated significance language does not mislead readers.
How it compares
A manuscript methodology checker skill, not a general code-review or lint skill for application repos.
Common Questions / FAQ
Who is peer-review for?
Solo builders and small teams preparing scientific or technical manuscripts who want systematic statistical and methods feedback from an agent.
When should I use peer-review?
In Validate while tightening study claims and scope; in Ship before submitting or posting a preprint; in Grow when updating public research content for accuracy.
Is peer-review safe to install?
Treat uploaded manuscripts as sensitive intellectual property; check the Security Audits panel on this page and avoid sending confidential data to untrusted hosts.
SKILL.md
READMESKILL.md - Peer Review
# Common Methodological and Statistical Issues in Scientific Manuscripts This document catalogs frequent issues encountered during peer review, organized by category. Use this as a reference to identify potential problems and provide constructive feedback. ## Statistical Issues ### 1. P-Value Misuse and Misinterpretation **Common Problems:** - P-hacking (selective reporting of significant results) - Multiple testing without correction (familywise error rate inflation) - Interpreting non-significance as proof of no effect - Focusing exclusively on p-values without effect sizes - Dichotomizing continuous p-values at arbitrary thresholds (p=0.049 vs p=0.051) - Confusing statistical significance with biological/clinical significance **How to Identify:** - Suspiciously high proportion of p-values just below 0.05 - Many tests performed but no correction mentioned - Statements like "no difference was found" from non-significant results - No effect sizes or confidence intervals reported - Language suggesting p-values indicate strength of effect **What to Recommend:** - Report effect sizes with confidence intervals - Apply appropriate multiple testing corrections (Bonferroni, FDR, Holm-Bonferroni) - Interpret non-significance cautiously (lack of evidence ≠ evidence of lack) - Pre-register analyses to avoid p-hacking - Consider equivalence testing for "no difference" claims ### 2. Inappropriate Statistical Tests **Common Problems:** - Using parametric tests when assumptions are violated (non-normal data, unequal variances) - Analyzing paired data with unpaired tests - Using t-tests for multiple groups instead of ANOVA with post-hoc tests - Treating ordinal data as continuous - Ignoring repeated measures structure - Using correlation when regression is more appropriate **How to Identify:** - No mention of assumption checking - Small sample sizes with parametric tests - Multiple pairwise t-tests instead of ANOVA - Likert scales analyzed with t-tests - Time-series data analyzed without accounting for repeated measures **What to Recommend:** - Check assumptions explicitly (normality tests, Q-Q plots) - Use non-parametric alternatives when appropriate - Apply proper corrections for multiple comparisons after ANOVA - Use mixed-effects models for repeated measures - Consider ordinal regression for ordinal outcomes ### 3. Sample Size and Power Issues **Common Problems:** - No sample size justification or power calculation - Underpowered studies claiming "no effect" - Post-hoc power calculations (which are uninformative) - Stopping rules not pre-specified - Unequal group sizes without justification **How to Identify:** - Small sample sizes (n<30 per group for typical designs) - No mention of power analysis in methods - Statements about post-hoc power - Wide confidence intervals suggesting imprecision - Claims of "no effect" with large p-values and small n **What to Recommend:** - Conduct a priori power analysis based on expected effect size - Report achieved power or precision (confidence interval width) - Acknowledge when studies are underpowered - Consider effect sizes and confidence intervals for interpretation - Pre-register sample size and stopping rules ### 4. Missing Data Problems **Common Problems:** - Complete case analysis without justification (listwise deletion) - Not reporting extent or pattern of missingness - Assuming data are missing completely at random (MCAR) without testing - Inappropriate imputation methods - Not performing sensitivity analyses **How to Identify:** - Different n values across analyses without explanation - No discussion of missing data - Participants "excluded from analysis" - Simple mean imputation used - No sensitivity analyses comparing complete vs. imputed data **What to Recommend:** - Report extent and patterns of missingness - Test MCAR assumption (Little's test) - Use appropriate methods (multiple imputation, maximum likelihood) - Perform sensitivity analyses - Consider intention-to-t